Cross Reference: /netbsd-current/lib/libnvmm/libnvmm.c

History log of /netbsd-current/lib/libnvmm/libnvmm.c
Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 1.20	06-Apr-2021	reinoud	Implement nvmm_vcpu::stop, a race-free exit from nvmm_vcpu_run() without signals. This introduces a new kernel and userland NVMM version indicating this support. Patch by Kamil Rytarowski <kamil@netbsd.org> and committed on his request. This is the missing libnvmm part I forgot to include in the origional commit.
# 1.19	05-Sep-2020	maxv	nvmm: update copyright headers
Revision tags: phil-wifi-20200421 phil-wifi-20200411 is-mlppp-base phil-wifi-20200406 phil-wifi-20191119
# 1.18	27-Oct-2019	maxv	Change the way root_owner works: consider the calling process as root_owner not if it has root privileges, but if the /dev/nvmm device was opened with write permissions. Introduce the undocumented nvmm_root_init() function to achieve that. The goal is to simplify the logic and have more granularity, eg if we want a monitoring agent to access VMs but don't want to give this agent real root access on the system.
# 1.17	27-Oct-2019	maxv	Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago.
# 1.16	23-Oct-2019	maxv	Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU.
# 1.15	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.14	08-Jun-2019	maxv	branches: 1.14.2; 1.14.4; Change the NVMM API to reduce data movements. Sent to tech-kern@.
# 1.13	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
# 1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
# 1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
# 1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118
# 1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
Revision tags: pgoyette-compat-1226
# 1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
# 1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
# 1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
Revision tags: pgoyette-compat-1126
# 1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
# 1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
# 1.19	05-Sep-2020	maxv	nvmm: update copyright headers
Revision tags: phil-wifi-20200421 phil-wifi-20200411 is-mlppp-base phil-wifi-20200406 phil-wifi-20191119
# 1.18	27-Oct-2019	maxv	Change the way root_owner works: consider the calling process as root_owner not if it has root privileges, but if the /dev/nvmm device was opened with write permissions. Introduce the undocumented nvmm_root_init() function to achieve that. The goal is to simplify the logic and have more granularity, eg if we want a monitoring agent to access VMs but don't want to give this agent real root access on the system.
# 1.17	27-Oct-2019	maxv	Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago.
# 1.16	23-Oct-2019	maxv	Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU.
# 1.15	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.14	08-Jun-2019	maxv	branches: 1.14.2; 1.14.4; Change the NVMM API to reduce data movements. Sent to tech-kern@.
# 1.13	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
# 1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
# 1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
# 1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118
# 1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
Revision tags: pgoyette-compat-1226
# 1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
# 1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
# 1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
Revision tags: pgoyette-compat-1126
# 1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
# 1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
# 1.18	27-Oct-2019	maxv	Change the way root_owner works: consider the calling process as root_owner not if it has root privileges, but if the /dev/nvmm device was opened with write permissions. Introduce the undocumented nvmm_root_init() function to achieve that. The goal is to simplify the logic and have more granularity, eg if we want a monitoring agent to access VMs but don't want to give this agent real root access on the system.
# 1.17	27-Oct-2019	maxv	Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago.
# 1.16	23-Oct-2019	maxv	Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU.
# 1.15	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.14	08-Jun-2019	maxv	branches: 1.14.2; Change the NVMM API to reduce data movements. Sent to tech-kern@.
# 1.13	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
# 1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
# 1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
# 1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118
# 1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
Revision tags: pgoyette-compat-1226
# 1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
# 1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
# 1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
Revision tags: pgoyette-compat-1126
# 1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
# 1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
# 1.17	27-Oct-2019	maxv	Add the "nvmm" group, and make nvmm_init() public. Sent to tech-kern@ a few days ago.
# 1.16	23-Oct-2019	maxv	Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU.
# 1.15	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.14	08-Jun-2019	maxv	branches: 1.14.2; Change the NVMM API to reduce data movements. Sent to tech-kern@.
# 1.13	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
# 1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
# 1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
# 1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118
# 1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
Revision tags: pgoyette-compat-1226
# 1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
# 1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
# 1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
Revision tags: pgoyette-compat-1126
# 1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
# 1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
# 1.16	23-Oct-2019	maxv	Three changes in libnvmm: - Add 'mach' and 'vcpu' backpointers in the nvmm_io and nvmm_mem structures. - Rename 'nvmm_callbacks' to 'nvmm_assist_callbacks'. - Rename and migrate NVMM_MACH_CONF_CALLBACKS to NVMM_VCPU_CONF_CALLBACKS, it now becomes per-VCPU.
# 1.15	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.14	08-Jun-2019	maxv	branches: 1.14.2; Change the NVMM API to reduce data movements. Sent to tech-kern@.
# 1.13	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
# 1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
# 1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
# 1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118
# 1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
Revision tags: pgoyette-compat-1226
# 1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
# 1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
# 1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
Revision tags: pgoyette-compat-1126
# 1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
# 1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
# 1.15	23-Oct-2019	maxv	Miscellaneous changes in NVMM, to address several inconsistencies and issues in the libnvmm API. - Rename NVMM_CAPABILITY_VERSION to NVMM_KERN_VERSION, and check it in libnvmm. Introduce NVMM_USER_VERSION, for future use. - In libnvmm, open "/dev/nvmm" as read-only and with O_CLOEXEC. This is to avoid sharing the VMs with the children if the process forks. In the NVMM driver, force O_CLOEXEC on open(). - Rename the following things for consistency: nvmm_exit* -> nvmm_vcpu_exit* nvmm_event* -> nvmm_vcpu_event* NVMM_EXIT_* -> NVMM_VCPU_EXIT_* NVMM_EVENT_INTERRUPT_HW -> NVMM_VCPU_EVENT_INTR NVMM_EVENT_EXCEPTION -> NVMM_VCPU_EVENT_EXCP Delete NVMM_EVENT_INTERRUPT_SW, unused already. - Slightly reorganize the MI/MD definitions, for internal clarity. - Split NVMM_VCPU_EXIT_MSR in two: NVMM_VCPU_EXIT_{RD,WR}MSR. Also provide separate u.rdmsr and u.wrmsr fields. This is more consistent with the other exit reasons. - Change the types of several variables: event.type enum -> u_int event.vector uint64_t -> uint8_t exit.u.*msr.msr: uint64_t -> uint32_t exit.u.io.type: enum -> bool exit.u.io.seg: int -> int8_t cap.arch.mxcsr_mask: uint64_t -> uint32_t cap.arch.conf_cpuid_maxops: uint64_t -> uint32_t - Delete NVMM_VCPU_EXIT_MWAIT_COND, it is AMD-only and confusing, and we already intercept 'monitor' so it is never armed. - Introduce vmx_exit_insn() for NVMM-Intel, similar to svm_exit_insn(). The 'npc' field wasn't getting filled properly during certain VMEXITs. - Introduce nvmm_vcpu_configure(). Similar to nvmm_machine_configure(), but as its name indicates, the configuration is per-VCPU and not per-VM. Migrate and rename NVMM_MACH_CONF_X86_CPUID to NVMM_VCPU_CONF_CPUID. This becomes per-VCPU, which makes more sense than per-VM. - Extend the NVMM_VCPU_CONF_CPUID conf to allow triggering VMEXITs on specific leaves. Until now we could only mask the leaves. An uint32_t is added in the structure: uint32_t mask:1; uint32_t exit:1; uint32_t rsvd:30; The two first bits select the desired behavior on the leaf. Specifying zero on both resets the leaf to the default behavior. The new NVMM_VCPU_EXIT_CPUID exit reason is added.
Revision tags: netbsd-9-base phil-wifi-20190609
# 1.14	08-Jun-2019	maxv	branches: 1.14.2; Change the NVMM API to reduce data movements. Sent to tech-kern@.
# 1.13	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
# 1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
# 1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
# 1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118
# 1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
Revision tags: pgoyette-compat-1226
# 1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
# 1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
# 1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
Revision tags: pgoyette-compat-1126
# 1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
# 1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
# 1.14	08-Jun-2019	maxv	Change the NVMM API to reduce data movements. Sent to tech-kern@.
# 1.13	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
# 1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
# 1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
# 1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118
# 1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
Revision tags: pgoyette-compat-1226
# 1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
# 1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
# 1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
Revision tags: pgoyette-compat-1126
# 1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
# 1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
# 1.13	11-May-2019	maxv	Rework the machine configuration interface. Provide three ranges in the conf space: <libnvmm:0-100>, <MI:100-200> and <MD:200-...>. Remove nvmm_callbacks_register(), and replace it by the conf op NVMM_MACH_CONF_CALLBACKS, handled by libnvmm. The callbacks are now per-machine, and the emulators should now do: - nvmm_callbacks_register(&cbs); + nvmm_machine_configure(&mach, NVMM_MACH_CONF_CALLBACKS, &cbs); This provides more granularity, for example if the process runs two VMs and wants different callbacks for each.
# 1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
# 1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
# 1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
# 1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118
# 1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
Revision tags: pgoyette-compat-1226
# 1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
# 1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
# 1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
Revision tags: pgoyette-compat-1126
# 1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
# 1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.
# 1.12	01-May-2019	maxv	Use the comm page to inject events, rather than ioctls, and commit them in vcpu_run. This saves a few syscalls and copyins. For example on Windows 10, moving the mouse from the left to right sides of the screen generates ~500 events, which now don't result in syscalls. The error handling is done in vcpu_run and it is less precise, but this doesn't matter a lot, and will be solved with future NVMM error codes.
# 1.11	29-Apr-2019	maxv	Remove useless calls to nvmm_init().
# 1.10	28-Apr-2019	maxv	Modify the communication layer between the kernel NVMM driver and libnvmm: introduce a bidirectionnal "comm page", a page of memory shared between the kernel and userland, and used to transfer data in and out in a more performant manner than ioctls. The comm page contains the VCPU state, plus three flags: - "wanted": the states the kernel must get/set when requested via ioctls - "cached": the states that are in the comm page - "commit": the states the kernel must set in vcpu_run The idea is to avoid performing expensive syscalls, by using the VCPU state cached, either explicitly or speculatively, in the comm page. For example, if the state is cached we do a direct 1->5 with no syscall: +---------------------------------------------+ \| Qemu \| +---------------------------------------------+ \| ^ \| (0) nvmm_vcpu_getstate \| (6) Done \| \| V \| +---------------------------------------+ \| libnvmm \| +---------------------------------------+ \| ^ \| ^ (1) State \| \| (2) No \| (3) Ioctl: \| (5) Ok, state cached? \| \| \| "please cache \| fetched \| \| \| the state" \| V \| \| \| +-----------+ \| \| \| Comm Page \|------+---------------+ +-----------+ \| ^ \| (4) "Alright \| V babe" \| +--------+ +-----\| Kernel \| +--------+ The main changes in behavior are: - nvmm_vcpu_getstate(): won't emit a syscall if the state is already cached in the comm page, will just fetch from the comm page directly - nvmm_vcpu_setstate(): won't emit a syscall at all, will just cache the wanted state in the comm page - nvmm_vcpu_run(): will commit the to-be-set state in the comm page, as previously requested by nvmm_vcpu_setstate() In addition to this, the kernel NVMM driver is changed to speculatively cache certain states known to be of interest, so that the future nvmm_vcpu_getstate() calls libnvmm or the emulator will perform will use the comm page rather than expensive syscalls. For example, if an I/O VMEXIT occurs, the I/O Assist in libnvmm will want GPRS+SEGS+CRS+MSRS, and now the kernel caches all of that in the comm page before returning to userland. Overall, in a normal run of Windows 10, this saves several millions of syscalls. Eg on a 4CPU Intel with 4VCPUs, booting the Win10 install ISO goes from taking 1min35 to taking 1min16. The libnvmm API is not changed, but the ABI is. If we changed the API it would be possible to save expensive memcpys on libnvmm's side. This will be avoided in a future version. The comm page can also be extended to implement future services.
# 1.9	10-Apr-2019	maxv	Add the NVMM_CTL ioctl, always privileged regardless of the permissions of /dev/nvmm. We'll use it to provide a way for an admin to control the registered VMs in the kernel. Add an associated wrapper in libnvmm.
# 1.8	04-Apr-2019	maxv	Check the GPA permissions too in the Assists, because it is possible that the guest traps on a page the virtualizer marked as read-only (even if it appears as read-write in the HVA).
# 1.7	21-Mar-2019	maxv	Make it possible for an emulator to set the protection of the guest pages. For some reason I had initially concluded that it wasn't doable; verily it is, so let's do it. The reserved 'flags' argument of nvmm_gpa_map() becomes 'prot' and takes mmap-like protection codes.
Revision tags: pgoyette-compat-20190127 pgoyette-compat-20190118
# 1.6	27-Dec-2018	maxv	Several improvements and fixes: * Change the Assist API. Rather than passing callbacks in each call, the callbacks are now registered beforehand. Then change the I/O Assist to fetch MMIO data via the Mem callback. This allows a guest to perform an I/O string operation on a memory that is itself an MMIO. * Introduce two new functions internal to libnvmm, read_guest_memory and write_guest_memory. They can handle mapped memory, MMIO memory and cross-page transactions. * Allow nvmm_gva_to_gpa and nvmm_gpa_to_hva to take non-page-aligned addresses. This simplifies a lot of things. * Support the MOVS instruction, and add a test for it. This instruction is special, in that it takes two implicit memory operands. In particular, it means that the two buffers can both be in MMIO memory, and we handle this case. * Fix gross copy-pasto in nvmm_hva_unmap. Also fix a few things here and there.
Revision tags: pgoyette-compat-1226
# 1.5	15-Dec-2018	maxv	Invert the mapping logic. Until now, the "owner" of the memory was the guest, and by calling nvmm_gpa_map(), the virtualizer was creating a view towards the guest memory. Qemu expects the contrary: it wants the owner to be the virtualizer, and nvmm_gpa_map should just create a view from the guest towards the virtualizer's address space. Under this scheme, it is legal to have two GPAs that point to the same HVA. Introduce nvmm_hva_map() and nvmm_hva_unmap(), that map/unamp the HVA into a dedicated UOBJ. Change nvmm_gpa_map() and nvmm_gpa_unmap() to just perform an enter into the desired UOBJ. With this change in place, all the mapping-related problems in Qemu+NVMM are fixed.
# 1.4	12-Dec-2018	maxv	Change the map/unmap functions, again.
# 1.3	29-Nov-2018	maxv	Rewrite the gpa map/unmap functions. Dig holes in the mapped areas when there is an overlap. Close to what Qemu expects.
Revision tags: pgoyette-compat-1126
# 1.2	19-Nov-2018	maxv	branches: 1.2.2; Fix error handling of realloc, and use memmove because the areas overlap; noted by agc@. These _nvmm_area_add/delete functions don't make a lot of sense right now and will likely be rewritten to match the behavior expected by Qemu; but still fix for the time being. Also fix a collision check while here.
# 1.1	10-Nov-2018	maxv	Add libnvmm, NetBSD's new virtualization API. It provides a way for VMM software to effortlessly create and manage virtual machines via NVMM. It is mostly complete, only nvmm_assist_mem needs to be filled -- I have a draft for that, but it needs some more care. This Mem Assist should not be needed when emulating a system in x2apic mode, so theoretically the current form of libnvmm is sufficient to emulate a whole class of systems. Generally speaking, there are so many modes in x86 that it is difficult to handle each corner case without introducing a ton of checks that just slow down the common-case execution. Currently we check a limited number of things; we may add more checks in the future if they turn out to be needed, but that's rather low priority. Libnvmm is compiled and installed only on amd64. A man page (reviewed by wiz@) is provided.