Cross Reference: /netbsd-current/sys/dev/nvmm/x86/nvmm_x86

History log of /netbsd-current/sys/dev/nvmm/x86/nvmm_x86_svm.c
Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 1.85	23-Feb-2023	riastradh	nvmm: Filter CR4 bits on x86 SVM (AMD). In particular, prohibit PKE, Protection Key Enable, which requires some additional management of CPU state by nvmm.
Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.84	20-Aug-2022	riastradh	x86: Split most of pmap.h into pmap_private.h or vmparam.h. This way pmap.h only contains the MD definition of the MI pmap(9) API, which loads of things in the kernel rely on, so changing x86 pmap internals no longer requires recompiling the entire kernel every time. Callers needing these internals must now use machine/pmap_private.h. Note: This is not x86/pmap_private.h because it contains three parts: 1. CPU-specific (different for i386/amd64) definitions used by... 2. common definitions, including Xenisms like xpmap_ptetomach, further used by... 3. more CPU-specific inlines for pmap_pte_* operations So {amd64,i386}/pmap_private.h defines 1, includes x86/pmap_private.h for 2, and then defines 3. Maybe we should split that out into a new pmap_pte.h to reduce this trouble. No functional change intended, other than that some .c files must include machine/pmap_private.h when previously uvm/uvm_pmap.h polluted the namespace with pmap internals. Note: This migrates part of i386/pmap.h into i386/vmparam.h -- specifically the parts that are needed for several constants defined in vmparam.h: VM_MAXUSER_ADDRESS VM_MAX_ADDRESS VM_MAX_KERNEL_ADDRESS VM_MIN_KERNEL_ADDRESS Since i386 needs PDP_SIZE in vmparam.h, I added it there on amd64 too, just to keep things parallel.
Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.83	26-Mar-2021	reinoud	Implement nvmm_vcpu::stop, a race-free exit from nvmm_vcpu_run() without signals. This introduces a new kernel and userland NVMM version indicating this support. Patch by Kamil Rytarowski <kamil@netbsd.org> and committed on his request.
# 1.82	24-Oct-2020	mgorny	branches: 1.82.2; 1.82.4; Issue 64-bit versions of XSAVE for 64-bit amd64 programs When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64 use the 64-suffixed variant in order to include the complete FIP/FDP registers in the x87 area. The difference between the two variants is that the FXSAVE64 (new) variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64), while the legacy FXSAVE variant uses split fields: 32-bit offset, 16-bit segment and 16-bit reserved field (union fp_addr.fa_32). The latter implies that the actual addresses are truncated to 32 bits which is insufficient in modern programs. The change is applied only to 64-bit programs on amd64. Plain i386 and compat32 continue using plain FXSAVE. Similarly, NVMM is not changed as I am not familiar with that code. This is a potentially breaking change. However, I don't think it likely to actually break anything because the data provided by the old variant were not meaningful (because of the truncated pointer).
# 1.81	08-Sep-2020	maxv	nvmm-x86: avoid hogging behavior observed recently When the FPU code got rewritten in NetBSD, the dependency on IPL_HIGH was eliminated, and I took _vcpu_guest_fpu_enter() out of the VCPU loop since there was no need to be in the splhigh window. Later, the code was switched to use the kernel FPU API, API that works at IPL_VM, not at IPL_NONE. These two changes mean that the whole VCPU loop is now executing at IPL_VM, which is not desired, because it introduces a delay in interrupt processing on the host in certain cases. Fix this by putting _vcpu_guest_fpu_enter() back inside the VCPU loop.
# 1.80	08-Sep-2020	maxv	nvmm: cosmetic changes - Style. - Explicitly include ioccom.h.
# 1.79	06-Sep-2020	riastradh	Fix fallout from previous uvm.h cleanup. - pmap(9) needs uvm/uvm_extern.h. - x86/pmap.h is not usable on its own; it is only usable if included via uvm/uvm_extern.h (-> uvm/uvm_pmap.h -> machine/pmap.h). - Make nvmm.h and nvmm_internal.h standalone.
# 1.78	05-Sep-2020	riastradh	Round of uvm.h cleanup. The poorly named uvm.h is generally supposed to be for uvm-internal users only. - Narrow it to files that actually need it -- mostly files that need to query whether curlwp is the pagedaemon, which should maybe be exposed by an external header. - Use uvm_extern.h where feasible and uvm_*.h for things not exposed by it. We should split up uvm_extern.h but this will serve for now to reduce the uvm.h dependencies. - Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use UVMHIST(ubchist), since ubchist is declared in uvm.h but the reference evaporates if UVMHIST is not defined, so we reduce header file dependencies. - Make uvm_device.h and uvm_swap.h independently includable while here. ok chs@
# 1.77	05-Sep-2020	maxv	x86: rename PGEX_X -> PGEX_I To match the x86 specification and the other OSes.
# 1.76	05-Sep-2020	maxv	nvmm: update copyright headers
# 1.75	04-Sep-2020	maxv	nvmm-x86-svm: check the SVM revision Only revision 1 exists, but check it, for future-proofness.
# 1.74	26-Aug-2020	maxv	nvmm-x86-svm: improve the handling of MSR_EFER Intercept reads of it as well, just to mask EFER_SVME, which the guest doesn't need to see.
# 1.73	26-Aug-2020	maxv	nvmm-x86: improve the handling of RFLAGS.RF - When injecting certain exceptions, set RF. For us to have an up-to-date view of RFLAGS, we commit the state before the event. - When advancing RIP, clear RF.
# 1.72	26-Aug-2020	maxv	nvmm-x86-svm: don't forget to intercept INVD INVD executed in the guest can be dangerous for the host, due to CPU caches being flushed without write-back.
# 1.71	22-Aug-2020	maxv	nvmm-x86-svm: dedup code
# 1.70	20-Aug-2020	maxv	nvmm-x86: improve the CPUID emulation - x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter contains extended features we must filter out. Apply the same in x86-vmx for symmetry. - x86-svm: explicitly handle extended leaves until 0x8000001F, and truncate to it.
# 1.69	18-Aug-2020	maxv	nvmm-x86-svm: improve the CPUID emulation Limit the hypervisor range, and properly handle each basic leaf until 0xD.
# 1.68	18-Aug-2020	maxv	nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.84	20-Aug-2022	riastradh	x86: Split most of pmap.h into pmap_private.h or vmparam.h. This way pmap.h only contains the MD definition of the MI pmap(9) API, which loads of things in the kernel rely on, so changing x86 pmap internals no longer requires recompiling the entire kernel every time. Callers needing these internals must now use machine/pmap_private.h. Note: This is not x86/pmap_private.h because it contains three parts: 1. CPU-specific (different for i386/amd64) definitions used by... 2. common definitions, including Xenisms like xpmap_ptetomach, further used by... 3. more CPU-specific inlines for pmap_pte_* operations So {amd64,i386}/pmap_private.h defines 1, includes x86/pmap_private.h for 2, and then defines 3. Maybe we should split that out into a new pmap_pte.h to reduce this trouble. No functional change intended, other than that some .c files must include machine/pmap_private.h when previously uvm/uvm_pmap.h polluted the namespace with pmap internals. Note: This migrates part of i386/pmap.h into i386/vmparam.h -- specifically the parts that are needed for several constants defined in vmparam.h: VM_MAXUSER_ADDRESS VM_MAX_ADDRESS VM_MAX_KERNEL_ADDRESS VM_MIN_KERNEL_ADDRESS Since i386 needs PDP_SIZE in vmparam.h, I added it there on amd64 too, just to keep things parallel.
Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.83	26-Mar-2021	reinoud	Implement nvmm_vcpu::stop, a race-free exit from nvmm_vcpu_run() without signals. This introduces a new kernel and userland NVMM version indicating this support. Patch by Kamil Rytarowski <kamil@netbsd.org> and committed on his request.
# 1.82	24-Oct-2020	mgorny	branches: 1.82.2; 1.82.4; Issue 64-bit versions of XSAVE for 64-bit amd64 programs When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64 use the 64-suffixed variant in order to include the complete FIP/FDP registers in the x87 area. The difference between the two variants is that the FXSAVE64 (new) variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64), while the legacy FXSAVE variant uses split fields: 32-bit offset, 16-bit segment and 16-bit reserved field (union fp_addr.fa_32). The latter implies that the actual addresses are truncated to 32 bits which is insufficient in modern programs. The change is applied only to 64-bit programs on amd64. Plain i386 and compat32 continue using plain FXSAVE. Similarly, NVMM is not changed as I am not familiar with that code. This is a potentially breaking change. However, I don't think it likely to actually break anything because the data provided by the old variant were not meaningful (because of the truncated pointer).
# 1.81	08-Sep-2020	maxv	nvmm-x86: avoid hogging behavior observed recently When the FPU code got rewritten in NetBSD, the dependency on IPL_HIGH was eliminated, and I took _vcpu_guest_fpu_enter() out of the VCPU loop since there was no need to be in the splhigh window. Later, the code was switched to use the kernel FPU API, API that works at IPL_VM, not at IPL_NONE. These two changes mean that the whole VCPU loop is now executing at IPL_VM, which is not desired, because it introduces a delay in interrupt processing on the host in certain cases. Fix this by putting _vcpu_guest_fpu_enter() back inside the VCPU loop.
# 1.80	08-Sep-2020	maxv	nvmm: cosmetic changes - Style. - Explicitly include ioccom.h.
# 1.79	06-Sep-2020	riastradh	Fix fallout from previous uvm.h cleanup. - pmap(9) needs uvm/uvm_extern.h. - x86/pmap.h is not usable on its own; it is only usable if included via uvm/uvm_extern.h (-> uvm/uvm_pmap.h -> machine/pmap.h). - Make nvmm.h and nvmm_internal.h standalone.
# 1.78	05-Sep-2020	riastradh	Round of uvm.h cleanup. The poorly named uvm.h is generally supposed to be for uvm-internal users only. - Narrow it to files that actually need it -- mostly files that need to query whether curlwp is the pagedaemon, which should maybe be exposed by an external header. - Use uvm_extern.h where feasible and uvm_*.h for things not exposed by it. We should split up uvm_extern.h but this will serve for now to reduce the uvm.h dependencies. - Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use UVMHIST(ubchist), since ubchist is declared in uvm.h but the reference evaporates if UVMHIST is not defined, so we reduce header file dependencies. - Make uvm_device.h and uvm_swap.h independently includable while here. ok chs@
# 1.77	05-Sep-2020	maxv	x86: rename PGEX_X -> PGEX_I To match the x86 specification and the other OSes.
# 1.76	05-Sep-2020	maxv	nvmm: update copyright headers
# 1.75	04-Sep-2020	maxv	nvmm-x86-svm: check the SVM revision Only revision 1 exists, but check it, for future-proofness.
# 1.74	26-Aug-2020	maxv	nvmm-x86-svm: improve the handling of MSR_EFER Intercept reads of it as well, just to mask EFER_SVME, which the guest doesn't need to see.
# 1.73	26-Aug-2020	maxv	nvmm-x86: improve the handling of RFLAGS.RF - When injecting certain exceptions, set RF. For us to have an up-to-date view of RFLAGS, we commit the state before the event. - When advancing RIP, clear RF.
# 1.72	26-Aug-2020	maxv	nvmm-x86-svm: don't forget to intercept INVD INVD executed in the guest can be dangerous for the host, due to CPU caches being flushed without write-back.
# 1.71	22-Aug-2020	maxv	nvmm-x86-svm: dedup code
# 1.70	20-Aug-2020	maxv	nvmm-x86: improve the CPUID emulation - x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter contains extended features we must filter out. Apply the same in x86-vmx for symmetry. - x86-svm: explicitly handle extended leaves until 0x8000001F, and truncate to it.
# 1.69	18-Aug-2020	maxv	nvmm-x86-svm: improve the CPUID emulation Limit the hypervisor range, and properly handle each basic leaf until 0xD.
# 1.68	18-Aug-2020	maxv	nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.83	26-Mar-2021	reinoud	Implement nvmm_vcpu::stop, a race-free exit from nvmm_vcpu_run() without signals. This introduces a new kernel and userland NVMM version indicating this support. Patch by Kamil Rytarowski <kamil@netbsd.org> and committed on his request.
Revision tags: thorpej-cfargs-base thorpej-futex-base
# 1.82	24-Oct-2020	mgorny	Issue 64-bit versions of XSAVE for 64-bit amd64 programs When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64 use the 64-suffixed variant in order to include the complete FIP/FDP registers in the x87 area. The difference between the two variants is that the FXSAVE64 (new) variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64), while the legacy FXSAVE variant uses split fields: 32-bit offset, 16-bit segment and 16-bit reserved field (union fp_addr.fa_32). The latter implies that the actual addresses are truncated to 32 bits which is insufficient in modern programs. The change is applied only to 64-bit programs on amd64. Plain i386 and compat32 continue using plain FXSAVE. Similarly, NVMM is not changed as I am not familiar with that code. This is a potentially breaking change. However, I don't think it likely to actually break anything because the data provided by the old variant were not meaningful (because of the truncated pointer).
# 1.81	08-Sep-2020	maxv	nvmm-x86: avoid hogging behavior observed recently When the FPU code got rewritten in NetBSD, the dependency on IPL_HIGH was eliminated, and I took _vcpu_guest_fpu_enter() out of the VCPU loop since there was no need to be in the splhigh window. Later, the code was switched to use the kernel FPU API, API that works at IPL_VM, not at IPL_NONE. These two changes mean that the whole VCPU loop is now executing at IPL_VM, which is not desired, because it introduces a delay in interrupt processing on the host in certain cases. Fix this by putting _vcpu_guest_fpu_enter() back inside the VCPU loop.
# 1.80	08-Sep-2020	maxv	nvmm: cosmetic changes - Style. - Explicitly include ioccom.h.
# 1.79	06-Sep-2020	riastradh	Fix fallout from previous uvm.h cleanup. - pmap(9) needs uvm/uvm_extern.h. - x86/pmap.h is not usable on its own; it is only usable if included via uvm/uvm_extern.h (-> uvm/uvm_pmap.h -> machine/pmap.h). - Make nvmm.h and nvmm_internal.h standalone.
# 1.78	05-Sep-2020	riastradh	Round of uvm.h cleanup. The poorly named uvm.h is generally supposed to be for uvm-internal users only. - Narrow it to files that actually need it -- mostly files that need to query whether curlwp is the pagedaemon, which should maybe be exposed by an external header. - Use uvm_extern.h where feasible and uvm_*.h for things not exposed by it. We should split up uvm_extern.h but this will serve for now to reduce the uvm.h dependencies. - Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use UVMHIST(ubchist), since ubchist is declared in uvm.h but the reference evaporates if UVMHIST is not defined, so we reduce header file dependencies. - Make uvm_device.h and uvm_swap.h independently includable while here. ok chs@
# 1.77	05-Sep-2020	maxv	x86: rename PGEX_X -> PGEX_I To match the x86 specification and the other OSes.
# 1.76	05-Sep-2020	maxv	nvmm: update copyright headers
# 1.75	04-Sep-2020	maxv	nvmm-x86-svm: check the SVM revision Only revision 1 exists, but check it, for future-proofness.
# 1.74	26-Aug-2020	maxv	nvmm-x86-svm: improve the handling of MSR_EFER Intercept reads of it as well, just to mask EFER_SVME, which the guest doesn't need to see.
# 1.73	26-Aug-2020	maxv	nvmm-x86: improve the handling of RFLAGS.RF - When injecting certain exceptions, set RF. For us to have an up-to-date view of RFLAGS, we commit the state before the event. - When advancing RIP, clear RF.
# 1.72	26-Aug-2020	maxv	nvmm-x86-svm: don't forget to intercept INVD INVD executed in the guest can be dangerous for the host, due to CPU caches being flushed without write-back.
# 1.71	22-Aug-2020	maxv	nvmm-x86-svm: dedup code
# 1.70	20-Aug-2020	maxv	nvmm-x86: improve the CPUID emulation - x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter contains extended features we must filter out. Apply the same in x86-vmx for symmetry. - x86-svm: explicitly handle extended leaves until 0x8000001F, and truncate to it.
# 1.69	18-Aug-2020	maxv	nvmm-x86-svm: improve the CPUID emulation Limit the hypervisor range, and properly handle each basic leaf until 0xD.
# 1.68	18-Aug-2020	maxv	nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.82	24-Oct-2020	mgorny	Issue 64-bit versions of XSAVE for 64-bit amd64 programs When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64 use the 64-suffixed variant in order to include the complete FIP/FDP registers in the x87 area. The difference between the two variants is that the FXSAVE64 (new) variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64), while the legacy FXSAVE variant uses split fields: 32-bit offset, 16-bit segment and 16-bit reserved field (union fp_addr.fa_32). The latter implies that the actual addresses are truncated to 32 bits which is insufficient in modern programs. The change is applied only to 64-bit programs on amd64. Plain i386 and compat32 continue using plain FXSAVE. Similarly, NVMM is not changed as I am not familiar with that code. This is a potentially breaking change. However, I don't think it likely to actually break anything because the data provided by the old variant were not meaningful (because of the truncated pointer).
# 1.81	08-Sep-2020	maxv	nvmm-x86: avoid hogging behavior observed recently When the FPU code got rewritten in NetBSD, the dependency on IPL_HIGH was eliminated, and I took _vcpu_guest_fpu_enter() out of the VCPU loop since there was no need to be in the splhigh window. Later, the code was switched to use the kernel FPU API, API that works at IPL_VM, not at IPL_NONE. These two changes mean that the whole VCPU loop is now executing at IPL_VM, which is not desired, because it introduces a delay in interrupt processing on the host in certain cases. Fix this by putting _vcpu_guest_fpu_enter() back inside the VCPU loop.
# 1.80	08-Sep-2020	maxv	nvmm: cosmetic changes - Style. - Explicitly include ioccom.h.
# 1.79	06-Sep-2020	riastradh	Fix fallout from previous uvm.h cleanup. - pmap(9) needs uvm/uvm_extern.h. - x86/pmap.h is not usable on its own; it is only usable if included via uvm/uvm_extern.h (-> uvm/uvm_pmap.h -> machine/pmap.h). - Make nvmm.h and nvmm_internal.h standalone.
# 1.78	05-Sep-2020	riastradh	Round of uvm.h cleanup. The poorly named uvm.h is generally supposed to be for uvm-internal users only. - Narrow it to files that actually need it -- mostly files that need to query whether curlwp is the pagedaemon, which should maybe be exposed by an external header. - Use uvm_extern.h where feasible and uvm_*.h for things not exposed by it. We should split up uvm_extern.h but this will serve for now to reduce the uvm.h dependencies. - Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use UVMHIST(ubchist), since ubchist is declared in uvm.h but the reference evaporates if UVMHIST is not defined, so we reduce header file dependencies. - Make uvm_device.h and uvm_swap.h independently includable while here. ok chs@
# 1.77	05-Sep-2020	maxv	x86: rename PGEX_X -> PGEX_I To match the x86 specification and the other OSes.
# 1.76	05-Sep-2020	maxv	nvmm: update copyright headers
# 1.75	04-Sep-2020	maxv	nvmm-x86-svm: check the SVM revision Only revision 1 exists, but check it, for future-proofness.
# 1.74	26-Aug-2020	maxv	nvmm-x86-svm: improve the handling of MSR_EFER Intercept reads of it as well, just to mask EFER_SVME, which the guest doesn't need to see.
# 1.73	26-Aug-2020	maxv	nvmm-x86: improve the handling of RFLAGS.RF - When injecting certain exceptions, set RF. For us to have an up-to-date view of RFLAGS, we commit the state before the event. - When advancing RIP, clear RF.
# 1.72	26-Aug-2020	maxv	nvmm-x86-svm: don't forget to intercept INVD INVD executed in the guest can be dangerous for the host, due to CPU caches being flushed without write-back.
# 1.71	22-Aug-2020	maxv	nvmm-x86-svm: dedup code
# 1.70	20-Aug-2020	maxv	nvmm-x86: improve the CPUID emulation - x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter contains extended features we must filter out. Apply the same in x86-vmx for symmetry. - x86-svm: explicitly handle extended leaves until 0x8000001F, and truncate to it.
# 1.69	18-Aug-2020	maxv	nvmm-x86-svm: improve the CPUID emulation Limit the hypervisor range, and properly handle each basic leaf until 0xD.
# 1.68	18-Aug-2020	maxv	nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.81	08-Sep-2020	maxv	nvmm-x86: avoid hogging behavior observed recently When the FPU code got rewritten in NetBSD, the dependency on IPL_HIGH was eliminated, and I took _vcpu_guest_fpu_enter() out of the VCPU loop since there was no need to be in the splhigh window. Later, the code was switched to use the kernel FPU API, API that works at IPL_VM, not at IPL_NONE. These two changes mean that the whole VCPU loop is now executing at IPL_VM, which is not desired, because it introduces a delay in interrupt processing on the host in certain cases. Fix this by putting _vcpu_guest_fpu_enter() back inside the VCPU loop.
# 1.80	08-Sep-2020	maxv	nvmm: cosmetic changes - Style. - Explicitly include ioccom.h.
# 1.79	06-Sep-2020	riastradh	Fix fallout from previous uvm.h cleanup. - pmap(9) needs uvm/uvm_extern.h. - x86/pmap.h is not usable on its own; it is only usable if included via uvm/uvm_extern.h (-> uvm/uvm_pmap.h -> machine/pmap.h). - Make nvmm.h and nvmm_internal.h standalone.
# 1.78	05-Sep-2020	riastradh	Round of uvm.h cleanup. The poorly named uvm.h is generally supposed to be for uvm-internal users only. - Narrow it to files that actually need it -- mostly files that need to query whether curlwp is the pagedaemon, which should maybe be exposed by an external header. - Use uvm_extern.h where feasible and uvm_*.h for things not exposed by it. We should split up uvm_extern.h but this will serve for now to reduce the uvm.h dependencies. - Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use UVMHIST(ubchist), since ubchist is declared in uvm.h but the reference evaporates if UVMHIST is not defined, so we reduce header file dependencies. - Make uvm_device.h and uvm_swap.h independently includable while here. ok chs@
# 1.77	05-Sep-2020	maxv	x86: rename PGEX_X -> PGEX_I To match the x86 specification and the other OSes.
# 1.76	05-Sep-2020	maxv	nvmm: update copyright headers
# 1.75	04-Sep-2020	maxv	nvmm-x86-svm: check the SVM revision Only revision 1 exists, but check it, for future-proofness.
# 1.74	26-Aug-2020	maxv	nvmm-x86-svm: improve the handling of MSR_EFER Intercept reads of it as well, just to mask EFER_SVME, which the guest doesn't need to see.
# 1.73	26-Aug-2020	maxv	nvmm-x86: improve the handling of RFLAGS.RF - When injecting certain exceptions, set RF. For us to have an up-to-date view of RFLAGS, we commit the state before the event. - When advancing RIP, clear RF.
# 1.72	26-Aug-2020	maxv	nvmm-x86-svm: don't forget to intercept INVD INVD executed in the guest can be dangerous for the host, due to CPU caches being flushed without write-back.
# 1.71	22-Aug-2020	maxv	nvmm-x86-svm: dedup code
# 1.70	20-Aug-2020	maxv	nvmm-x86: improve the CPUID emulation - x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter contains extended features we must filter out. Apply the same in x86-vmx for symmetry. - x86-svm: explicitly handle extended leaves until 0x8000001F, and truncate to it.
# 1.69	18-Aug-2020	maxv	nvmm-x86-svm: improve the CPUID emulation Limit the hypervisor range, and properly handle each basic leaf until 0xD.
# 1.68	18-Aug-2020	maxv	nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.79	06-Sep-2020	riastradh	Fix fallout from previous uvm.h cleanup. - pmap(9) needs uvm/uvm_extern.h. - x86/pmap.h is not usable on its own; it is only usable if included via uvm/uvm_extern.h (-> uvm/uvm_pmap.h -> machine/pmap.h). - Make nvmm.h and nvmm_internal.h standalone.
# 1.78	05-Sep-2020	riastradh	Round of uvm.h cleanup. The poorly named uvm.h is generally supposed to be for uvm-internal users only. - Narrow it to files that actually need it -- mostly files that need to query whether curlwp is the pagedaemon, which should maybe be exposed by an external header. - Use uvm_extern.h where feasible and uvm_*.h for things not exposed by it. We should split up uvm_extern.h but this will serve for now to reduce the uvm.h dependencies. - Use uvm_stat.h and #ifdef UVMHIST uvm.h for files that use UVMHIST(ubchist), since ubchist is declared in uvm.h but the reference evaporates if UVMHIST is not defined, so we reduce header file dependencies. - Make uvm_device.h and uvm_swap.h independently includable while here. ok chs@
# 1.77	05-Sep-2020	maxv	x86: rename PGEX_X -> PGEX_I To match the x86 specification and the other OSes.
# 1.76	05-Sep-2020	maxv	nvmm: update copyright headers
# 1.75	04-Sep-2020	maxv	nvmm-x86-svm: check the SVM revision Only revision 1 exists, but check it, for future-proofness.
# 1.74	26-Aug-2020	maxv	nvmm-x86-svm: improve the handling of MSR_EFER Intercept reads of it as well, just to mask EFER_SVME, which the guest doesn't need to see.
# 1.73	26-Aug-2020	maxv	nvmm-x86: improve the handling of RFLAGS.RF - When injecting certain exceptions, set RF. For us to have an up-to-date view of RFLAGS, we commit the state before the event. - When advancing RIP, clear RF.
# 1.72	26-Aug-2020	maxv	nvmm-x86-svm: don't forget to intercept INVD INVD executed in the guest can be dangerous for the host, due to CPU caches being flushed without write-back.
# 1.71	22-Aug-2020	maxv	nvmm-x86-svm: dedup code
# 1.70	20-Aug-2020	maxv	nvmm-x86: improve the CPUID emulation - x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter contains extended features we must filter out. Apply the same in x86-vmx for symmetry. - x86-svm: explicitly handle extended leaves until 0x8000001F, and truncate to it.
# 1.69	18-Aug-2020	maxv	nvmm-x86-svm: improve the CPUID emulation Limit the hypervisor range, and properly handle each basic leaf until 0xD.
# 1.68	18-Aug-2020	maxv	nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.77	05-Sep-2020	maxv	x86: rename PGEX_X -> PGEX_I To match the x86 specification and the other OSes.
# 1.76	05-Sep-2020	maxv	nvmm: update copyright headers
# 1.75	04-Sep-2020	maxv	nvmm-x86-svm: check the SVM revision Only revision 1 exists, but check it, for future-proofness.
# 1.74	26-Aug-2020	maxv	nvmm-x86-svm: improve the handling of MSR_EFER Intercept reads of it as well, just to mask EFER_SVME, which the guest doesn't need to see.
# 1.73	26-Aug-2020	maxv	nvmm-x86: improve the handling of RFLAGS.RF - When injecting certain exceptions, set RF. For us to have an up-to-date view of RFLAGS, we commit the state before the event. - When advancing RIP, clear RF.
# 1.72	26-Aug-2020	maxv	nvmm-x86-svm: don't forget to intercept INVD INVD executed in the guest can be dangerous for the host, due to CPU caches being flushed without write-back.
# 1.71	22-Aug-2020	maxv	nvmm-x86-svm: dedup code
# 1.70	20-Aug-2020	maxv	nvmm-x86: improve the CPUID emulation - x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter contains extended features we must filter out. Apply the same in x86-vmx for symmetry. - x86-svm: explicitly handle extended leaves until 0x8000001F, and truncate to it.
# 1.69	18-Aug-2020	maxv	nvmm-x86-svm: improve the CPUID emulation Limit the hypervisor range, and properly handle each basic leaf until 0xD.
# 1.68	18-Aug-2020	maxv	nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.74	26-Aug-2020	maxv	nvmm-x86-svm: improve the handling of MSR_EFER Intercept reads of it as well, just to mask EFER_SVME, which the guest doesn't need to see.
# 1.73	26-Aug-2020	maxv	nvmm-x86: improve the handling of RFLAGS.RF - When injecting certain exceptions, set RF. For us to have an up-to-date view of RFLAGS, we commit the state before the event. - When advancing RIP, clear RF.
# 1.72	26-Aug-2020	maxv	nvmm-x86-svm: don't forget to intercept INVD INVD executed in the guest can be dangerous for the host, due to CPU caches being flushed without write-back.
# 1.71	22-Aug-2020	maxv	nvmm-x86-svm: dedup code
# 1.70	20-Aug-2020	maxv	nvmm-x86: improve the CPUID emulation - x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter contains extended features we must filter out. Apply the same in x86-vmx for symmetry. - x86-svm: explicitly handle extended leaves until 0x8000001F, and truncate to it.
# 1.69	18-Aug-2020	maxv	nvmm-x86-svm: improve the CPUID emulation Limit the hypervisor range, and properly handle each basic leaf until 0xD.
# 1.68	18-Aug-2020	maxv	nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.71	22-Aug-2020	maxv	nvmm-x86-svm: dedup code
# 1.70	20-Aug-2020	maxv	nvmm-x86: improve the CPUID emulation - x86-svm: explicitly handle 0x80000007 and 0x80000008. The latter contains extended features we must filter out. Apply the same in x86-vmx for symmetry. - x86-svm: explicitly handle extended leaves until 0x8000001F, and truncate to it.
# 1.69	18-Aug-2020	maxv	nvmm-x86-svm: improve the CPUID emulation Limit the hypervisor range, and properly handle each basic leaf until 0xD.
# 1.68	18-Aug-2020	maxv	nvmm-x86: also flush the guest TLB when CR4.{PCIDE,SMEP} changes
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.67	05-Aug-2020	maxv	Add new field definitions, and intercept everything, for future-proofness.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.66	05-Aug-2020	maxv	Use ULL, to make it clear we are unsigned.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.65	19-Jul-2020	maxv	Switch to fpu_kern_enter/leave, to prevent clobbering, now that the kernel itself uses the fpu.
# 1.64	19-Jul-2020	maxv	The TLB flush IPIs do not respect the IPL, so enforcing IPL_HIGH has no effect. Disable interrupts earlier instead. This prevents a possible race against such IPIs.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.63	03-Jul-2020	maxv	Print the backend name when attaching.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.62	24-May-2020	maxv	Gather the conditions to return from the VCPU loops in nvmm_return_needed(), and use it in nvmm_do_vcpu_run() as well. This fixes two undesired behaviors: - When a VM initializes, the many nested page faults that need processing could cause the calling thread to occupy the CPU too much if we're unlucky and are only getting repeated nested page faults thousands of times in a row. - When the emulator calls nvmm_vcpu_run() and immediately sends a signal to stop the VCPU, it's better to check signals earlier and leave right away, rather than doing a round of VCPU run that could increase the time spent by the emulator waiting for the return.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.61	10-May-2020	maxv	Respect the convention for the hypervisor information: return the highest hypervisor leaf in 0x40000000.EAX.
# 1.60	09-May-2020	maxv	Improve the CPUID emulation of basic leaves: - Hide DCA and PQM, they cannot be used in guests. - On Intel, explicitly handle each basic leaf until 0x16. - On AMD, explicitly handle each basic leaf until 0x0D.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.59	30-Apr-2020	maxv	When the identification fails, print the reason.
Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base phil-wifi-20200406
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: is-mlppp-base ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.58	22-Mar-2020	ad	x86 pmap: - Give pmap_remove_all() its own version of pmap_remove_ptes() that on native x86 does the bare minimum needed to clear out PTPs. Cuts ~4% sys time on 'build.sh release' for me. - pmap_sync_pv(): there's no need to issue a redundant TLB shootdown. The caller waits for the competing operation to finish. - Bring 'options TLBSTATS' up to date.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.57	14-Mar-2020	ad	- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small functions: preempt_point() and preempt_needed(). - preempt(): if the LWP has exceeded its timeslice in kernel, strip it of any priority boost gained earlier from blocking.
Revision tags: ad-namecache-base3
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	branches: 1.55.2; pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.56	21-Feb-2020	joerg	Explicitly cast pointers to uintptr_t before casting to enums. They are not necessarily the same size. Don't cast pointers to bool, check for NULL instead.
Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.55	10-Dec-2019	ad	pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.55	10-Dec-2019	ad	pg->phys_addr > VM_PAGE_TO_PHYS(pg)
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.54	20-Nov-2019	maxv	Hide XSAVES-specific stuff and the masked extended states.
Revision tags: phil-wifi-20191119
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.53	28-Oct-2019	maxv	A few changes: - Use smaller types in struct nvmm_capability. - Use smaller type for nvmm_io.port. - Switch exitstate to a compacted structure.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.52	27-Oct-2019	maxv	Add a new VCPU conf option, that allows userland to request VMEXITs after a TPR change. This is supported on all Intel CPUs, and not-too-old AMD CPUs. The reason for wanting this option is that certain OSes (like Win10 64bit) manage interrupt priority in hardware via CR8 directly, and for these OSes, the emulator may want to sync its internal TPR state on each change. Add two new fields in cap.arch, to report the conf capabilities. Report TPR only on Intel for now, not AMD, because I don't have a recent AMD CPU on which to test.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.51	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.50	12-Oct-2019	maxv	Rewrite the FPU code on x86. This greatly simplifies the logic and removes the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on port-amd64 a week ago. Bump the kernel version to 9.99.16.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; 1.46.4; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.49	04-Oct-2019	maxv	Switch to the new PTE naming.
# 1.48	04-Oct-2019	maxv	Fix definition for MWAIT. It should be bit 11, not 12; 12 is the armed version.
# 1.47	04-Oct-2019	maxv	Add definitions for RDPRU, MCOMMIT, GMET and VTE.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.46	11-May-2019	maxv	branches: 1.46.2; Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.46	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.
# 1.45	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.44	29-Apr-2019	maxv	Stop taking care of the INT/NMI windows in the kernel, the emulator is supposed to do that itself.
# 1.43	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.42	27-Apr-2019	maxv	Reorder the NVMM headers, to make a clear(er) distinction between MI and MD. Also use #defines for the exit reasons rather than an union. No ABI change, and no API change except 'cap->u.{}' renamed to 'cap->arch'.
# 1.41	27-Apr-2019	maxv	If guest events were being processed when a #VMEXIT occurred, reschedule the events rather than dismissing them. This can happen for instance when a guest wants to process an exception and an #NPF occurs on the guest IDT. In practice it occurs only when the host swapped out specific guest pages.
# 1.40	24-Apr-2019	maxv	Provide the hardware error code for NVMM_EXIT_INVALID, useful when debugging.
Revision tags: isaki-audio2-base
# 1.39	20-Apr-2019	maxv	Ah, take XSAVE into account in ECX too, not just in EBX. Otherwise if the guest relies only on ECX to initialize/copy the FPU state (like NetBSD does), spurious #GPs can be encountered because the bitmap is clobbered.
# 1.38	07-Apr-2019	maxv	Invert the filtering priority: now the kernel-managed cpuid leaves are overwritable by the virtualizer. This is useful to virtualizers that want to 100% control every leaf.
# 1.37	06-Apr-2019	maxv	Replace the misc[] state by a new compressed nvmm_x64_state_intr structure, which describes the interruptibility state of the guest. Add evt_pending, read-only, that allows the virtualizer to know if an event is pending.
# 1.36	03-Apr-2019	maxv	Add MSR_TSC.
# 1.35	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
# 1.34	14-Mar-2019	maxv	Reduce the mask of the VTPR, only the first four bits matter.
# 1.33	03-Mar-2019	maxv	Choose which CPUID bits to allow, rather than which bits to disallow. This is clearer, and also forward compatible with future CPUs. While here be more consistent when allowing the bits, and sync between nvmm-amd and nvmm-intel. Also make sure to disallow AVX, because the guest state we provide is only x86+SSE. Fixes a CentOS panic when booting on NVMM, reported by Jared McNeill, thanks.
# 1.32	26-Feb-2019	maxv	Change the layout of the SEG state: - Reorder it, to match the CPU encoding. This is the universal order, also used by Qemu. Drop the seg_to_nvmm[] tables. - Compress it. This divides its size by two. - Rename some of its fields, to better match the x86 spec. Also, take S out of Type, this was a NetBSD-ism that was likely confusing to other people.
# 1.31	23-Feb-2019	maxv	Install the x86 RESET state at VCPU creation time, for convenience, so that the libnvmm users can expect a functional VCPU right away.
# 1.30	23-Feb-2019	maxv	Reorder the functions, and constify setstate. No functional change.
# 1.29	21-Feb-2019	maxv	Another locking issue in NVMM: the {svm,vmx}_tlb_flush functions take VCPU mutexes which can sleep, but their context does not allow it. Rewrite the TLB handling code to fix that. It becomes a bit complex. In short, we use a per-VM generation number, which we increase on each TLB flush, before sending a broadcast IPI to everybody. The IPIs cause a #VMEXIT of each VCPU, and each VCPU Loop will synchronize the per-VM gen with a per-VCPU copy, and apply the flushes as neededi lazily. The behavior differs between AMD and Intel; in short, on Intel we don't flush the hTLB (EPT cache) if a context switch of a VCPU occurs, so now, we need to maintain a kcpuset to know which VCPU's hTLBs are active on which hCPU. This creates some redundancy on Intel, ie there are cases where we flush the hTLB several times unnecessarily; but hTLB flushes are very rare, so there is no real performance regression. The thing is lock-less and non-blocking, so it solves our problem.
# 1.28	21-Feb-2019	maxv	Clarify the gTLB code a little.
# 1.27	18-Feb-2019	maxv	Ah, finally found you. Fix scheduling bug in NVMM. When processing guest page faults, we were calling uvm_fault with preemption disabled. The thing is, uvm_fault may block, and if it does, we land in sleepq_block which calls mi_switch; so we get switched away while we explicitly asked not to be. From then on things could go really wrong. Fix that by processing such faults in MI, where we have preemption enabled and are allowed to block. A KASSERT in sleepq_block (or before) would have helped.
# 1.26	16-Feb-2019	maxv	Ah no, adapt previous, on AMD RAX is in the VMCB.
# 1.25	16-Feb-2019	maxv	Improve the FPU detection: hide XSAVES because we're not allowing it, and don't set CPUID2_OSXSAVE if the guest didn't first set CR4_OSXSAVE. With these changes in place, I can boot Windows 10 on NVMM.
# 1.24	15-Feb-2019	maxv	Initialize the guest TSC to zero at VCPU creation time, and handle guest writes to MSR_TSC at run time. This is imprecise, because the hardware does not provide a way to preserve the TSC during #VMEXITs, but that's fine enough.
# 1.23	14-Feb-2019	maxv	Harmonize the handling of the CPL between AMD and Intel. AMD has a separate guest CPL field, because on AMD, the SYSCALL/SYSRET instructions do not force SS.DPL to predefined values. On Intel they do, so the CPL on Intel is just the guest's SS.DPL value. Even though technically possible on AMD, there is no sane reason for a guest kernel to set a non-three SS.DPL, doing that would mess up several common segmentation practices and wouldn't be compatible with Intel. So, force the Intel behavior on AMD, by always setting SS.DPL<=>CPL. Remove the now unused CPL field from nvmm_x64_state::misc[]. This actually increases performance on AMD: to detect interrupt windows the virtualizer has to modify some fields of misc[], and because CPL was there, we had to flush the SEG set of the VMCB cache. Now there is no flush necessary. While here remove the CPL check for XSETBV on Intel, contrary to AMD Intel checks the CPL before the intercept, so if we receive an XSETBV VMEXIT, we are certain that it was executed at CPL=0 in the guest. By the way my check was wrong in the first place, it was reading SS.RPL instead of SS.DPL.
# 1.22	13-Feb-2019	maxv	Drop support for software interrupts. I had initially added that to cover the three event types available on AMD, but Intel has seven of them, all with weird and twisted meanings, and they require extra parameters. Software interrupts should not be used anyway.
# 1.21	13-Feb-2019	maxv	Micro optimization: the STAR/LSTAR/CSTAR/SFMASK MSRs are static, so rather than saving them on each VMENTRY, save them only once, at VCPU creation time.
# 1.20	12-Feb-2019	maxv	Optimize: the hardware does not clear the TLB flush command after a VMENTRY, so clear it ourselves, to avoid uselessly flushing the guest TLB. While here also fix the processing of EFER-induced flushes, they shouldn't be delayed.
# 1.19	04-Feb-2019	maxv	Improvements: - Guest reads/writes to PAT land in gPAT, so no need to emulate them. - When emulating EFER, don't advance the RIP if a fault occurs, and don't forget to flush the VMCB cache accordingly.
Revision tags: pgoyette-compat-20190127
# 1.18	26-Jan-2019	maxv	Remove nvmm_exit_memory.npc, useless.
# 1.17	24-Jan-2019	maxv	Optimize: change the behavior of the HLT vmexit, make it a "change in vcpu state" which occurs after the instruction executed, rather than an instruction intercept which occurs before. Disable the shadow and the intr window in kernel mode, and advance the RIP, so that the virtualizer doesn't have to do it itself. This saves two syscalls and one VMCB cache flush. Provide npc for other instruction intercepts, in case someone is interested.
# 1.16	20-Jan-2019	maxv	Improvements in NVMM * Handle the FPU differently, limit the states via the given mask rather than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by the virtualizer, to force a reload from memory. * Hide RDTSCP. * Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature. * Take ECX and not RCX on MSR instructions.
Revision tags: pgoyette-compat-20190118
# 1.15	13-Jan-2019	maxv	Reset DR7 before loading DR0-3, to prevent a fault if the host process has dbregs enabled.
# 1.14	10-Jan-2019	maxv	Optimize: * Don't save/restore the host CR2, we don't care because we're not in a #PF context (and preemption switches already handle CR2 safely). * Don't save/restore the host FS and GS, just reset them to zero after VMRUN. Note: DS and ES must be reset _before_ VMRUN, but that doesn't apply to FS and GS. * Handle FSBASE and KGSBASE outside of the VCPU loop, to avoid the cost of saving/restoring them when there's no reason to leave the loop.
# 1.13	08-Jan-2019	maxv	Optimize: don't keep a full copy of the guest state, rather take only what is needed. This avoids expensive memcpy's. Also flush the V_TPR as part of the CR-state, because there is CR8 in it.
# 1.12	07-Jan-2019	maxv	Optimize: cache the guest state entirely in the VMCB-cache, flush it on a state-by-state basis when needed.
# 1.11	06-Jan-2019	maxv	Add more VMCB fields. Also remove debugging code I mistakenly committed in the previous revision. No functional change.
# 1.10	06-Jan-2019	maxv	Improvements and fixes in NVMM. Kernel driver: * Don't take an extra (unneeded) reference to the UAO. * Provide npc for HLT. I'm not really happy with it right now, will likely be revisited. * Add the INT_SHADOW, INT_WINDOW_EXIT and NMI_WINDOW_EXIT states. Provide them in the exitstate too. * Don't take the TPR into account when processing INTs. The virtualizer can do that itself (Qemu already does). * Provide a hypervisor signature in CPUID, and hide SVM. * Ignore certain MSRs. One special case is MSR_NB_CFG in which we set NB_CFG_INITAPICCPUIDLO. Allow reads of MSR_TSC. * If the LWP has pending signals or softints, leave, rather than waiting for a rescheduling to happen later. This reduces interrupt processing time in the guest (Qemu sends a signal to the thread, and now we leave right away). This could be improved even more by sending an actual IPI to the CPU, but I'll see later. Libnvmm: * Fix the MMU translation of large pages, we need to add the lower bits too. * Change the IO and Mem structures to take a pointer rather than a static array. This provides more flexibility. * Batch together the str+rep IO transactions. We do one big memory read/write, and then send the IO commands to the hypervisor all at once. This considerably increases performance. * Decode MOVZX. With these changes in place, Qemu+NVMM works. I can install NetBSD 8.0 in a VM with multiple VCPUs, connect to the network, etc.
# 1.9	03-Jan-2019	maxv	Fix another gross copy-pasto.
# 1.8	02-Jan-2019	maxv	When there's no DecodeAssist in hardware, decode manually in software. This is needed on certain AMD CPUs (like mine): the segment base of OUTS can be overridden, and it is wrong to just assume DS. We fetch the instruction and look at the prefixes if any to determine the correct segment.
Revision tags: pgoyette-compat-1226
# 1.7	13-Dec-2018	maxv	Don't forget to advance the RIP after an XSETBV emulation.
Revision tags: pgoyette-compat-1126
# 1.6	25-Nov-2018	maxv	branches: 1.6.2; Add RFLAGS in the exitstate.
# 1.5	22-Nov-2018	maxv	Add missing pmap_update after pmap_kenter_pa, noted by Kamil.
# 1.4	19-Nov-2018	maxv	Rename one constant, for clarity.
# 1.3	14-Nov-2018	maxv	Take RAX from the VMCB and not the VCPU state, the latter is not synchronized and contains old values.
# 1.2	10-Nov-2018	maxv	Remove unused cpu_msr.h includes.
# 1.1	07-Nov-2018	maxv	Add NVMM - for NetBSD Virtual Machine Monitor -, a kernel driver that provides support for hardware-accelerated virtualization on NetBSD. It is made of an MI frontend, to which MD backends can be plugged. One MD backend is implemented, x86-SVM, for x86 AMD CPUs. We install /usr/include/dev/nvmm/nvmm.h /usr/include/dev/nvmm/nvmm_ioctl.h /usr/include/dev/nvmm/{arch}/nvmm_{arch}.h And the kernel module. For now, the only architecture where we do that is amd64 (arch=x86). NVMM is not enabled by default in amd64-GENERIC, but is instead easily modloadable. Sent to tech-kern@ a month ago. Validated with kASan, and optimized with tprof.