Cross Reference: /openbsd-current/sys/arch/amd64/amd64/locore.S

History log of /openbsd-current/sys/arch/amd64/amd64/locore.S
Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 1.147	17-Mar-2024	guenther	Use VERW to mitigate the RFDS (Register File Data Sampling) vulnerability present in Intel Atom CPUs, reordering some ASM in return-to-userspace and start/resume-vmx-guest to reduce the number of kernel values still live in registers when VERW is used. This mitigation requires updated firmware which has affected CPUs report RFDS_CLEAR in dmesg. Firmware packaging by jsg@ and sthen@ Logic for interpreting intel's flags by jsg@ after lots of discussion between him, deraadt@, and I ok deraadt@
Revision tags: OPENBSD_7_5_BASE
# 1.146	25-Feb-2024	guenther	We don't do compat32 so MSR_CSTAR shouldn't be set up: delete the Xsyscall32 stub and UCODE32 selector, set MSR_CSTAR to zero at CPU startup, and rezero on ACPI resume and VM exit. requested a while ago by deraadt@ AMD VM testing chris@ testing and ok krw@
# 1.145	12-Feb-2024	guenther	Retpolines are an anti-pattern for IBT, so we need to shift protecting userspace from cross-process BTI to the kernel. Have each CPU track the last pmap run on in userspace and the last vmm VCPU in guest-mode and use the IBPB msr to flush predictors right before running in userspace on a different pmap or entering guest-mode on a different VCPU. Codepatch-nop the userspace bits and conditionalize the vmm bits to keep working if IBPB isn't supported. ok deraadt@ kettenis@
# 1.144	12-Dec-2023	deraadt	remove support for syscall(2) -- the "indirection system call" because it is a dangerous alternative entry point for all system calls, and thus incompatible with the precision system call entry point scheme we are heading towards. This has been a 3-year mission: First perl needed a code-generated wrapper to fake syscall(2) as a giant switch table, then all the ports were cleaned with relatively minor fixes, except for "go". "go" required two fixes -- 1) a framework issue with old library versions, and 2) like perl, a fake syscall(2) wrapper to handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over the place in the "go" ecosystem because the "go developers" are plan9-loving unix-hating folk who tried to build an ecosystem without allowing "ioctl". ok kettenis, jsing, afresh1, sthen
# 1.143	12-Dec-2023	deraadt	The sigtramp was calling sigreturn(2), and upon failure exit(2), which doesn't make sense anymore. It is better to just issue an illegal instruction. ok kettenis, with some misgivings about inconsistant approaches between architectures. In the future we could change sigreturn(2) to never return an exit code, but always just terminate the process. We stopped this system call from being callable ages ago with msyscall(2), and there is no stub for it in libc.. maybe that's the next step to take?
# 1.142	10-Dec-2023	deraadt	Add a new label "sigcodecall" inside every sigtramp definition, directly in front of the syscall instruction. This is used to calculate the start of the syscall for SYS_sigreturn and pinned system calls. ok kettenis
# 1.141	24-Oct-2023	claudio	Normally context switches happen in mi_switch() but there are 3 cases where a switch happens outside. Cleanup these code paths and make the machine independent. - when a process forks (fork, tfork, kthread), the new proc needs to somehow be scheduled for the first time. This is done by proc_trampoline. Since proc_trampoline is machine dependent assembler code change the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make sure it is now always called. - cpu_hatch: when booting APs the code needs to jump to the first proc running on that CPU. This should be the idle thread for that CPU. - sched_exit: when a proc exits it needs to switch away from itself and then instruct the reaper to clean up the rest. This is done by switching to the idle loop. Since the last two cases require a context switch to the idle proc factor out the common code to sched_toidle() and use it in those places. Tested by many on all archs. OK miod@ mpi@ cheloha@
Revision tags: OPENBSD_7_4_BASE
# 1.140	31-Jul-2023	guenther	On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation") or IBT enabled the kernel, the hardware should the attacks which retpolines were created to prevent. In those cases, retpolines should be a net negative for security as they are an indirect branch gadget. They're also slower. * use -mretpoline-external-thunk to give us control of the code used for indirect branches * default to using a retpoline as before, but marks it and the other ASM kernel retpolines for code patching * if the CPU has eIBRS, then enable it * if the CPU has eIBRS or IBT, then codepatch the three different retpolines to just indirect jumps make clean && make config required after this ok kettenis@
# 1.139	28-Jul-2023	guenther	Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk of code to use in codepatching. Use that for all the existing codepatching snippets. Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also provides a short variable holding the length of the codepatch snippet. Use that for some snippets that will be used for retpoline replacement. ok kettenis@ deraadt@
# 1.138	27-Jul-2023	guenther	Follow the lead of mips64 and make cpu_idle_cycle() just call the indirect pointer itself and provide an initializer for that going to the default "just enable interrupts and halt" path. ok kettenis@
# 1.137	25-Jul-2023	guenther	cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define away the calls ok deraadt@ mpi@ miod@
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.146	25-Feb-2024	guenther	We don't do compat32 so MSR_CSTAR shouldn't be set up: delete the Xsyscall32 stub and UCODE32 selector, set MSR_CSTAR to zero at CPU startup, and rezero on ACPI resume and VM exit. requested a while ago by deraadt@ AMD VM testing chris@ testing and ok krw@
# 1.145	12-Feb-2024	guenther	Retpolines are an anti-pattern for IBT, so we need to shift protecting userspace from cross-process BTI to the kernel. Have each CPU track the last pmap run on in userspace and the last vmm VCPU in guest-mode and use the IBPB msr to flush predictors right before running in userspace on a different pmap or entering guest-mode on a different VCPU. Codepatch-nop the userspace bits and conditionalize the vmm bits to keep working if IBPB isn't supported. ok deraadt@ kettenis@
# 1.144	12-Dec-2023	deraadt	remove support for syscall(2) -- the "indirection system call" because it is a dangerous alternative entry point for all system calls, and thus incompatible with the precision system call entry point scheme we are heading towards. This has been a 3-year mission: First perl needed a code-generated wrapper to fake syscall(2) as a giant switch table, then all the ports were cleaned with relatively minor fixes, except for "go". "go" required two fixes -- 1) a framework issue with old library versions, and 2) like perl, a fake syscall(2) wrapper to handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over the place in the "go" ecosystem because the "go developers" are plan9-loving unix-hating folk who tried to build an ecosystem without allowing "ioctl". ok kettenis, jsing, afresh1, sthen
# 1.143	12-Dec-2023	deraadt	The sigtramp was calling sigreturn(2), and upon failure exit(2), which doesn't make sense anymore. It is better to just issue an illegal instruction. ok kettenis, with some misgivings about inconsistant approaches between architectures. In the future we could change sigreturn(2) to never return an exit code, but always just terminate the process. We stopped this system call from being callable ages ago with msyscall(2), and there is no stub for it in libc.. maybe that's the next step to take?
# 1.142	10-Dec-2023	deraadt	Add a new label "sigcodecall" inside every sigtramp definition, directly in front of the syscall instruction. This is used to calculate the start of the syscall for SYS_sigreturn and pinned system calls. ok kettenis
# 1.141	24-Oct-2023	claudio	Normally context switches happen in mi_switch() but there are 3 cases where a switch happens outside. Cleanup these code paths and make the machine independent. - when a process forks (fork, tfork, kthread), the new proc needs to somehow be scheduled for the first time. This is done by proc_trampoline. Since proc_trampoline is machine dependent assembler code change the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make sure it is now always called. - cpu_hatch: when booting APs the code needs to jump to the first proc running on that CPU. This should be the idle thread for that CPU. - sched_exit: when a proc exits it needs to switch away from itself and then instruct the reaper to clean up the rest. This is done by switching to the idle loop. Since the last two cases require a context switch to the idle proc factor out the common code to sched_toidle() and use it in those places. Tested by many on all archs. OK miod@ mpi@ cheloha@
Revision tags: OPENBSD_7_4_BASE
# 1.140	31-Jul-2023	guenther	On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation") or IBT enabled the kernel, the hardware should the attacks which retpolines were created to prevent. In those cases, retpolines should be a net negative for security as they are an indirect branch gadget. They're also slower. * use -mretpoline-external-thunk to give us control of the code used for indirect branches * default to using a retpoline as before, but marks it and the other ASM kernel retpolines for code patching * if the CPU has eIBRS, then enable it * if the CPU has eIBRS or IBT, then codepatch the three different retpolines to just indirect jumps make clean && make config required after this ok kettenis@
# 1.139	28-Jul-2023	guenther	Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk of code to use in codepatching. Use that for all the existing codepatching snippets. Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also provides a short variable holding the length of the codepatch snippet. Use that for some snippets that will be used for retpoline replacement. ok kettenis@ deraadt@
# 1.138	27-Jul-2023	guenther	Follow the lead of mips64 and make cpu_idle_cycle() just call the indirect pointer itself and provide an initializer for that going to the default "just enable interrupts and halt" path. ok kettenis@
# 1.137	25-Jul-2023	guenther	cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define away the calls ok deraadt@ mpi@ miod@
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.145	12-Feb-2024	guenther	Retpolines are an anti-pattern for IBT, so we need to shift protecting userspace from cross-process BTI to the kernel. Have each CPU track the last pmap run on in userspace and the last vmm VCPU in guest-mode and use the IBPB msr to flush predictors right before running in userspace on a different pmap or entering guest-mode on a different VCPU. Codepatch-nop the userspace bits and conditionalize the vmm bits to keep working if IBPB isn't supported. ok deraadt@ kettenis@
# 1.144	12-Dec-2023	deraadt	remove support for syscall(2) -- the "indirection system call" because it is a dangerous alternative entry point for all system calls, and thus incompatible with the precision system call entry point scheme we are heading towards. This has been a 3-year mission: First perl needed a code-generated wrapper to fake syscall(2) as a giant switch table, then all the ports were cleaned with relatively minor fixes, except for "go". "go" required two fixes -- 1) a framework issue with old library versions, and 2) like perl, a fake syscall(2) wrapper to handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over the place in the "go" ecosystem because the "go developers" are plan9-loving unix-hating folk who tried to build an ecosystem without allowing "ioctl". ok kettenis, jsing, afresh1, sthen
# 1.143	12-Dec-2023	deraadt	The sigtramp was calling sigreturn(2), and upon failure exit(2), which doesn't make sense anymore. It is better to just issue an illegal instruction. ok kettenis, with some misgivings about inconsistant approaches between architectures. In the future we could change sigreturn(2) to never return an exit code, but always just terminate the process. We stopped this system call from being callable ages ago with msyscall(2), and there is no stub for it in libc.. maybe that's the next step to take?
# 1.142	10-Dec-2023	deraadt	Add a new label "sigcodecall" inside every sigtramp definition, directly in front of the syscall instruction. This is used to calculate the start of the syscall for SYS_sigreturn and pinned system calls. ok kettenis
# 1.141	24-Oct-2023	claudio	Normally context switches happen in mi_switch() but there are 3 cases where a switch happens outside. Cleanup these code paths and make the machine independent. - when a process forks (fork, tfork, kthread), the new proc needs to somehow be scheduled for the first time. This is done by proc_trampoline. Since proc_trampoline is machine dependent assembler code change the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make sure it is now always called. - cpu_hatch: when booting APs the code needs to jump to the first proc running on that CPU. This should be the idle thread for that CPU. - sched_exit: when a proc exits it needs to switch away from itself and then instruct the reaper to clean up the rest. This is done by switching to the idle loop. Since the last two cases require a context switch to the idle proc factor out the common code to sched_toidle() and use it in those places. Tested by many on all archs. OK miod@ mpi@ cheloha@
Revision tags: OPENBSD_7_4_BASE
# 1.140	31-Jul-2023	guenther	On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation") or IBT enabled the kernel, the hardware should the attacks which retpolines were created to prevent. In those cases, retpolines should be a net negative for security as they are an indirect branch gadget. They're also slower. * use -mretpoline-external-thunk to give us control of the code used for indirect branches * default to using a retpoline as before, but marks it and the other ASM kernel retpolines for code patching * if the CPU has eIBRS, then enable it * if the CPU has eIBRS or IBT, then codepatch the three different retpolines to just indirect jumps make clean && make config required after this ok kettenis@
# 1.139	28-Jul-2023	guenther	Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk of code to use in codepatching. Use that for all the existing codepatching snippets. Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also provides a short variable holding the length of the codepatch snippet. Use that for some snippets that will be used for retpoline replacement. ok kettenis@ deraadt@
# 1.138	27-Jul-2023	guenther	Follow the lead of mips64 and make cpu_idle_cycle() just call the indirect pointer itself and provide an initializer for that going to the default "just enable interrupts and halt" path. ok kettenis@
# 1.137	25-Jul-2023	guenther	cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define away the calls ok deraadt@ mpi@ miod@
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.144	12-Dec-2023	deraadt	remove support for syscall(2) -- the "indirection system call" because it is a dangerous alternative entry point for all system calls, and thus incompatible with the precision system call entry point scheme we are heading towards. This has been a 3-year mission: First perl needed a code-generated wrapper to fake syscall(2) as a giant switch table, then all the ports were cleaned with relatively minor fixes, except for "go". "go" required two fixes -- 1) a framework issue with old library versions, and 2) like perl, a fake syscall(2) wrapper to handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over the place in the "go" ecosystem because the "go developers" are plan9-loving unix-hating folk who tried to build an ecosystem without allowing "ioctl". ok kettenis, jsing, afresh1, sthen
# 1.143	12-Dec-2023	deraadt	The sigtramp was calling sigreturn(2), and upon failure exit(2), which doesn't make sense anymore. It is better to just issue an illegal instruction. ok kettenis, with some misgivings about inconsistant approaches between architectures. In the future we could change sigreturn(2) to never return an exit code, but always just terminate the process. We stopped this system call from being callable ages ago with msyscall(2), and there is no stub for it in libc.. maybe that's the next step to take?
# 1.142	10-Dec-2023	deraadt	Add a new label "sigcodecall" inside every sigtramp definition, directly in front of the syscall instruction. This is used to calculate the start of the syscall for SYS_sigreturn and pinned system calls. ok kettenis
# 1.141	24-Oct-2023	claudio	Normally context switches happen in mi_switch() but there are 3 cases where a switch happens outside. Cleanup these code paths and make the machine independent. - when a process forks (fork, tfork, kthread), the new proc needs to somehow be scheduled for the first time. This is done by proc_trampoline. Since proc_trampoline is machine dependent assembler code change the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make sure it is now always called. - cpu_hatch: when booting APs the code needs to jump to the first proc running on that CPU. This should be the idle thread for that CPU. - sched_exit: when a proc exits it needs to switch away from itself and then instruct the reaper to clean up the rest. This is done by switching to the idle loop. Since the last two cases require a context switch to the idle proc factor out the common code to sched_toidle() and use it in those places. Tested by many on all archs. OK miod@ mpi@ cheloha@
Revision tags: OPENBSD_7_4_BASE
# 1.140	31-Jul-2023	guenther	On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation") or IBT enabled the kernel, the hardware should the attacks which retpolines were created to prevent. In those cases, retpolines should be a net negative for security as they are an indirect branch gadget. They're also slower. * use -mretpoline-external-thunk to give us control of the code used for indirect branches * default to using a retpoline as before, but marks it and the other ASM kernel retpolines for code patching * if the CPU has eIBRS, then enable it * if the CPU has eIBRS or IBT, then codepatch the three different retpolines to just indirect jumps make clean && make config required after this ok kettenis@
# 1.139	28-Jul-2023	guenther	Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk of code to use in codepatching. Use that for all the existing codepatching snippets. Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also provides a short variable holding the length of the codepatch snippet. Use that for some snippets that will be used for retpoline replacement. ok kettenis@ deraadt@
# 1.138	27-Jul-2023	guenther	Follow the lead of mips64 and make cpu_idle_cycle() just call the indirect pointer itself and provide an initializer for that going to the default "just enable interrupts and halt" path. ok kettenis@
# 1.137	25-Jul-2023	guenther	cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define away the calls ok deraadt@ mpi@ miod@
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.142	10-Dec-2023	deraadt	Add a new label "sigcodecall" inside every sigtramp definition, directly in front of the syscall instruction. This is used to calculate the start of the syscall for SYS_sigreturn and pinned system calls. ok kettenis
# 1.141	24-Oct-2023	claudio	Normally context switches happen in mi_switch() but there are 3 cases where a switch happens outside. Cleanup these code paths and make the machine independent. - when a process forks (fork, tfork, kthread), the new proc needs to somehow be scheduled for the first time. This is done by proc_trampoline. Since proc_trampoline is machine dependent assembler code change the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make sure it is now always called. - cpu_hatch: when booting APs the code needs to jump to the first proc running on that CPU. This should be the idle thread for that CPU. - sched_exit: when a proc exits it needs to switch away from itself and then instruct the reaper to clean up the rest. This is done by switching to the idle loop. Since the last two cases require a context switch to the idle proc factor out the common code to sched_toidle() and use it in those places. Tested by many on all archs. OK miod@ mpi@ cheloha@
Revision tags: OPENBSD_7_4_BASE
# 1.140	31-Jul-2023	guenther	On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation") or IBT enabled the kernel, the hardware should the attacks which retpolines were created to prevent. In those cases, retpolines should be a net negative for security as they are an indirect branch gadget. They're also slower. * use -mretpoline-external-thunk to give us control of the code used for indirect branches * default to using a retpoline as before, but marks it and the other ASM kernel retpolines for code patching * if the CPU has eIBRS, then enable it * if the CPU has eIBRS or IBT, then codepatch the three different retpolines to just indirect jumps make clean && make config required after this ok kettenis@
# 1.139	28-Jul-2023	guenther	Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk of code to use in codepatching. Use that for all the existing codepatching snippets. Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also provides a short variable holding the length of the codepatch snippet. Use that for some snippets that will be used for retpoline replacement. ok kettenis@ deraadt@
# 1.138	27-Jul-2023	guenther	Follow the lead of mips64 and make cpu_idle_cycle() just call the indirect pointer itself and provide an initializer for that going to the default "just enable interrupts and halt" path. ok kettenis@
# 1.137	25-Jul-2023	guenther	cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define away the calls ok deraadt@ mpi@ miod@
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.141	24-Oct-2023	claudio	Normally context switches happen in mi_switch() but there are 3 cases where a switch happens outside. Cleanup these code paths and make the machine independent. - when a process forks (fork, tfork, kthread), the new proc needs to somehow be scheduled for the first time. This is done by proc_trampoline. Since proc_trampoline is machine dependent assembler code change the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make sure it is now always called. - cpu_hatch: when booting APs the code needs to jump to the first proc running on that CPU. This should be the idle thread for that CPU. - sched_exit: when a proc exits it needs to switch away from itself and then instruct the reaper to clean up the rest. This is done by switching to the idle loop. Since the last two cases require a context switch to the idle proc factor out the common code to sched_toidle() and use it in those places. Tested by many on all archs. OK miod@ mpi@ cheloha@
Revision tags: OPENBSD_7_4_BASE
# 1.140	31-Jul-2023	guenther	On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation") or IBT enabled the kernel, the hardware should the attacks which retpolines were created to prevent. In those cases, retpolines should be a net negative for security as they are an indirect branch gadget. They're also slower. * use -mretpoline-external-thunk to give us control of the code used for indirect branches * default to using a retpoline as before, but marks it and the other ASM kernel retpolines for code patching * if the CPU has eIBRS, then enable it * if the CPU has eIBRS or IBT, then codepatch the three different retpolines to just indirect jumps make clean && make config required after this ok kettenis@
# 1.139	28-Jul-2023	guenther	Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk of code to use in codepatching. Use that for all the existing codepatching snippets. Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also provides a short variable holding the length of the codepatch snippet. Use that for some snippets that will be used for retpoline replacement. ok kettenis@ deraadt@
# 1.138	27-Jul-2023	guenther	Follow the lead of mips64 and make cpu_idle_cycle() just call the indirect pointer itself and provide an initializer for that going to the default "just enable interrupts and halt" path. ok kettenis@
# 1.137	25-Jul-2023	guenther	cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define away the calls ok deraadt@ mpi@ miod@
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.140	31-Jul-2023	guenther	On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation") or IBT enabled the kernel, the hardware should the attacks which retpolines were created to prevent. In those cases, retpolines should be a net negative for security as they are an indirect branch gadget. They're also slower. * use -mretpoline-external-thunk to give us control of the code used for indirect branches * default to using a retpoline as before, but marks it and the other ASM kernel retpolines for code patching * if the CPU has eIBRS, then enable it * if the CPU has eIBRS or IBT, then codepatch the three different retpolines to just indirect jumps make clean && make config required after this ok kettenis@
# 1.139	28-Jul-2023	guenther	Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk of code to use in codepatching. Use that for all the existing codepatching snippets. Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also provides a short variable holding the length of the codepatch snippet. Use that for some snippets that will be used for retpoline replacement. ok kettenis@ deraadt@
# 1.138	27-Jul-2023	guenther	Follow the lead of mips64 and make cpu_idle_cycle() just call the indirect pointer itself and provide an initializer for that going to the default "just enable interrupts and halt" path. ok kettenis@
# 1.137	25-Jul-2023	guenther	cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define away the calls ok deraadt@ mpi@ miod@
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.139	28-Jul-2023	guenther	Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk of code to use in codepatching. Use that for all the existing codepatching snippets. Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also provides a short variable holding the length of the codepatch snippet. Use that for some snippets that will be used for retpoline replacement. ok kettenis@ deraadt@
# 1.138	27-Jul-2023	guenther	Follow the lead of mips64 and make cpu_idle_cycle() just call the indirect pointer itself and provide an initializer for that going to the default "just enable interrupts and halt" path. ok kettenis@
# 1.137	25-Jul-2023	guenther	cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define away the calls ok deraadt@ mpi@ miod@
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.137	25-Jul-2023	guenther	cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define away the calls ok deraadt@ mpi@ miod@
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.136	10-Jul-2023	guenther	Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS to save/restore the state and enabling it at exec-time (and for signal handling) if the PS_NOBTCFI flag isn't set. Note: this changes the format of the sc_fpstate data in the signal context to possibly be in compressed format: starting now we just guarantee that that state is in a format understood by the XRSTOR instruction of the system that is being executed on. At this time, passing sigreturn a corrupt sc_fpstate now results in the process exiting with no attempt to fix it up or send a T_PROTFLT trap. That may change. prodding by deraadt@ issues with my original signal handling design identified by kettenis@ lots of base and ports preparation for this by deraadt@ and the libressl and ports teams ok deraadt@ kettenis@
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.135	05-Jul-2023	anton	The hypercall page populated with instructions by the hypervisor is not IBT compatible due to lack of endbr64. Replace the indirect call with a new hv_hypercall_trampoline() routine which jumps to the hypercall page without any indirection. Allows me to boot OpenBSD using Hyper-V on Windows 11 again. ok guenther@
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.134	17-Apr-2023	deraadt	For future userland IBT, the sigcode needs to start with a endbr64. This is simpler than clearing the cet_u bits in the kernel. ok guenther, kettenis
# 1.133	17-Apr-2023	deraadt	IDTVEC_NOALIGN() was the incorrect way to create a label in two places, use GENTRY() instead. Also add two endbr64 which cannot be supplied by macros ok guenther
Revision tags: OPENBSD_7_3_BASE
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.132	20-Jan-2023	deraadt	On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which contain PG_XO, which is PKU key1. On every exit from kernel to userland, force the PKU register to inhibit data read against key1 memory. On (some) traps into the kernel if the PKU register is changed, abort the process (processes have no reason to change the PKU register). This provides us with viable xonly functionality on most modern intel & AMD cpus. I started with a xsave-based diff from dv@, but discovered the fpu save/restore logic wasn't a good fit and went to direct register management. Disabled on HV (vm) systems until we know they handle PKU correctly. ok kettenis, dv, guenther, etc
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.131	01-Dec-2022	guenther	_C_LABEL() is no longer useful in the "everything is ELF" world. Start eliminating it. ok mpi@ mlarkin@ krw@
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.130	29-Nov-2022	guenther	Move the generic variable definitions from the ASM at the top of locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and biosextmem as unused/ignored. ok mpi@ krw@ mlarkin@
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.129	04-Nov-2022	kettenis	EFI firmware has bugs which may mean that calling EFI runtime services will fault because it does memory accesses outside of the regions it told us to map. Try to mitigate this by installing a fault handler (using the pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a page fault while executing an EFI runtime services call. Since some firmware bugs result in us executing code that isn't mapped, make kpageflttrap() handle execution faults as well as data faults. ok guenther@
Revision tags: OPENBSD_7_2_BASE
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.128	07-Aug-2022	guenther	Start to add annotations to the cpu_info members, doing I/a/o for immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and CPUF_USERXSTATE, which really are private to the CPU, into a new ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags alterations via atomic_{set,clear}bits_int(), so its annotation isn't a lie. Delete ci_info member as unused all the way from rev 1.1 ok jsg@ mlarkin@
Revision tags: OPENBSD_7_1_BASE
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.127	31-Dec-2021	jsg	specifed -> specified
Revision tags: OPENBSD_7_0_BASE
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.126	04-Sep-2021	bluhm	To mitigate against spectre attacks, AMD processors without the IBRS feature need an lfence instruction after every near ret. Place them after all functions in the kernel which are implemented in assembler. Change the retguard macro so that the end of the lfence instruction is 16-byte aligned now. This prevents that the ret instruction is at the end of a 32-byte boundary. The latter would cause a performance impact on certain Intel processors which have a microcode update to mitigate the jump conditional code erratum. See software techniques for managing speculation on AMD processors revision 9.17.20 mitigation G-5. See Intel mitigations for jump conditional code erratum revision 1.0 november 2019 2.4 software guidance and optimization methods. OK deraadt@ mortimer@
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.125	18-Jun-2021	guenther	The pmap needs to know which CPUs to send IPIs when TLB entries need to be invalidated. Instead of keeping a bitset of CPUs in each pmap, have each cpu_info track which pmap it has loaded: replace pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic operations (and cache thrashing) and simplifies cpu_switchto() Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?" test: ignore the CR3_REUSE_PCID bit when checking that. This makes switching between kernel threads slightly less costly. over a week in snaps with no complaints looks ok to mlarkin@ kettenis@ mpi@
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	branches: 1.122.2; Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	branches: 1.120.4; Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.124	01-Jun-2021	guenther	Don't clear the cpu's bit in the old pmap's pm_cpus until we're off the old one and set it in the new pmap's pm_cpus before loading %cr3 with the new value. In particular, do neither if %cr3 isn't changing. This eliminates a window where, when switching between threads in a single a process, the pmap wouldn't have this cpu's bit set even though we didn't change %cr3. With more of uvm unlocked, it was possible for another cpu to update the page tables but not see a need to send an IPI to this cpu, leading to crashes when TLB entries that should have been invalidated were used. malloc_duel testing by abluhm@ ok abluhm@ kettenis@ mlarkin@
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.123	25-May-2021	guenther	clang's assembler now supports 64-suffixed versions of the fxsave/xsave/fxrstor/xrstor family of instructions. Use them directly instead of inserting the 0x48 prefix manually. ok kettenis@ deraadt@
Revision tags: OPENBSD_6_9_BASE
# 1.122	03-Nov-2020	guenther	Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.122	03-Nov-2020	guenther	Give sizes to more of the functions in locore.S ok mpi@
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.121	02-Nov-2020	guenther	Restore abstraction of register saving into macros in frameasm.h The Meltdown mitigation work ran right across the previous abstractions; draw slightly different lines and use separate macros for interrupts vs traps vs syscall. The generated ASM for traps and general interrupts is completely unchanged; the ASM for the four directly routed interrupts is brought into line with the general interrupts; the ASM for syscalls is changed to delay reenabling interrupts until after all registers are saved and cleared. ok mpi@
Revision tags: OPENBSD_6_8_BASE
# 1.120	17-May-2020	deraadt	Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.120	17-May-2020	deraadt	Put setjmp+longjmp inside #ifdef DDB the only kernel-side user. This shrinks the ramdisks a tiny bit.
Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.119	07-Aug-2019	guenther	Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip or mis-take swapgs in interrupt path and in trap/fault/exception path. The latter is improved to have no conditionals around this when Meltdown mitigation is in effect. Codepatch out the fences based on the description of CPU bugs in the (well written) Linux commit message. feedback from kettenis@ ok deraadt@
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	branches: 1.116.2; Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	branches: 1.111.2; In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.118	17-May-2019	guenther	Mitigate Intel's Microarchitectural Data Sampling vulnerability. If the CPU has the new VERW behavior than that is used, otherwise use the proper sequence from Intel's "Deep Dive" doc is used in the return-to-userspace and enter-VMM-guest paths. The enter-C3-idle path is not mitigated because it's only a problem when SMT/HT is enabled: mitigating everything when that's enabled would be a _huge_ set of changes that we see no point in doing. Update vmm(4) to pass through the MSR bits so that guests can apply the optimal mitigation. VMM help and specific feedback from mlarkin@ vendor-portability help from jsg@ and kettenis@ ok kettenis@ mlarkin@ deraadt@ jsg@
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.117	12-May-2019	guenther	Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to cpu_idle_cycle() ok mpi@ kettenis@
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
Revision tags: OPENBSD_6_5_BASE
# 1.116	02-Apr-2019	mortimer	Add variable length trap padding between the retguard epilogue and the following return. This change adds a constraint that the name passed to the RETGUARD_* macros must correspond to the name in the corresponding ENTRY which starts the function (or a function which appears beforehand in the same file). Since we use the distance from the ENTRY definition to calculate how much padding to insert, the ENTRY symbol must be in scope at assembly time. This is almost always the case already, since it is the natural way to name the retguard symbols so they remain unique. ok deraadt@
# 1.115	01-Apr-2019	mortimer	Add retguard macros to kernel setjmp / longjmp. ok deraadt@ kettenis@
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.114	18-Feb-2019	yasuoka	Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also fixes kernel core dump to be readable by savecore. From fukaumi at soum.co.jp ok mlarkin
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.113	24-Jan-2019	deraadt	gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so move it to right place.
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.112	20-Jan-2019	mlarkin	Implement rdmsr_safe rdmsr_safe is used when reading potentially missing MSRs, to avoid triggering #GPs in the kernel. ok guenther
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
Revision tags: OPENBSD_6_4_BASE
# 1.111	07-Oct-2018	guenther	In vmm, handle xsetbv like xrstor: instead of trying to prevalidate the values, just try it and handle the #GP if it faults. Problem reported by Maxime Villard (max(at)m00nbsd.net) ok mlarkin@
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.110	04-Oct-2018	guenther	Use PCIDs where they and the INVPCID instruction are available. This uses one PCID for kernel threads, one for the U+K tables of normal processes, one for the matching U-K tables (when meltdown in effect), and one for temporary mappings when poking other processes. Some further tweaks are envisioned but this is good enough to provide more separation and has (finally) been stable under ports testing. lots of ports testing and valid complaints from naddy@ and sthen@ feedback from mlarkin@ and sf@
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.109	12-Sep-2018	guenther	Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119), avoid some TLB flushes by not reloading %cr3 when the value isn't changing. original diff by and ok mlarkin@
# 1.108	09-Sep-2018	guenther	Calculate automatically the padding necessary for lining up the iretq instruction used when Meltdown mitigation is effect. It got pushed off when an lfence was added in locore.S rev 1.107, resulting in two signals being sent instead of one when iretq faulted, and neither signal had the correct sigcontext info. Update the makefile rule for locore.o to verify that things are correct. ok mlarkin@
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.107	24-Jul-2018	guenther	Also do RSB refilling when context switching, after vmexits, and when vmlaunch or vmresume fails. Follow the lead of clang and the intel recommendation and do an lfence after the pause in the speculation-stop path for retpoline, RSB refill, and meltover ASM bits. ok kettenis@ deraadt@
# 1.106	23-Jul-2018	guenther	Do "Return stack refilling", based on the "Return stack underflow" discussion and its associated appendix at https://support.google.com/faqs/answer/7625886 This should address at least some cases of "SpectreRSB" and earlier Spectre variants; more commits to follow. The refilling is done in the enter-kernel-from-userspace and return-to-userspace-from-kernel paths, making sure to do it before unblocking interrupts so that a successive interrupt can't get the CPU to C code without doing this refill. Per the link above, it also does it immediately after mwait, apparently in case the low-power CPU states of idle-via-mwait flush the RSB. ok mlarkin@ deraadt@
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.105	12-Jul-2018	guenther	Reorganize the Meltdown entry and exit trampolines for syscall and traps so that the "mov %rax,%cr3" is followed by an infinite loop which is avoided because the mapping of the code being executed is changed. This means the sysretq/iretq isn't even present in that flow of instructions in the kernel mapping, so userspace code can't be speculatively reached on the kernel mapping and totally eliminates the conditional jump over the the %cr3 change that supported CPUs without the Meltdown vulnerability. The return paths were probably vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively executing user code post-system-call with the kernel mappings, thus creating cache/TLB/etc side-effects. Would like to apply this technique to the interrupt stubs too, but I'm hitting a bug in clang's assembler which misaligns the code and symbols. While here, when on a CPU not vulnerable to Meltdown, codepatch out the unnecessary bits in cpu_switchto(). Inspiration from sf@, refined over dinner with theo ok mlarkin@ deraadt@
# 1.104	10-Jul-2018	deraadt	In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY macro. amd64 binaries now are free of double+-nop sequences (except for one assember nit in aes-586.pl). Previous changes by guenther got us here. ok mortimer kettenis
# 1.103	03-Jul-2018	mortimer	Add retguard macros for kernel asm. ok deraadt, ok mlarkin (vmm_support)
# 1.102	01-Jul-2018	guenther	Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then use it where that was manually written before. No binary change. ok deraadt@
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.101	14-Jun-2018	guenther	Clear the GPRs when entering the kernel from userspace so that user-controlled values can't take part in speculative execution in the kernel down paths that end up "not taken" but that may cause user-visible effects (cache, etc). prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe ok deraadt@ kettenis@
# 1.100	09-Jun-2018	guenther	Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps and intr_fast_exit clean ok mpi@
# 1.99	07-Jun-2018	guenther	Apply the retpoline transformation to indirect jumps in the raw ASM ok mlarkin@ mortimer@ deraadt@
# 1.98	05-Jun-2018	guenther	Switch from lazy FPU switching to semi-eager FPU switching: track whether curproc's xstate ("extended state") is loaded in the CPU or not. - context switch, sendsig(), vmm, and doing CPU crypto in the kernel all check the flag and, if set, save the old thread's state to the PCB, clear the flag, and then load the _blank_ state - when returning to userspace, if the flag is clear then set it and restore the thread's state This simpler tracking also fixes the restoring of FPU state after nested signal handlers. With this, %cr0's TS flag is never set, the FPU #DNA trap can no longer happen, and IPIs are no longer necessary for flushing or syncing FPU state; on the other hand, restoring xstate while returning to userspace means we have to handle xrstor faulting if we could be loading an altered state. If that happens, reset the state, fake a #GP fault (SIGBUS), and recheck for ASTs. While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by using codepatching to switch to xsave/xrstor when present in the CPU. In addition, code patch in use of xsaveopt in most places when the CPU supports that. Use the 64bit-wide variants of the instructions in all cases so that x87 instruction fault IPs are reported correctly. This change has three motivations: 1) with modern clang, SSE registers are used even in rcrt0.o, making lazy FPU switching a smaller benefit vs trap costs 2) the Intel SDM warns that lazy FPU switching may increase power costs 3) post-Spectre rumors suggest that the %cr0 TS flag might not block speculation, permitting leaking of information about FPU state (AES keys?) across protection boundaries. tested by many in snaps; prodding from deraadt@
# 1.97	05-Jun-2018	guenther	Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit. Move AST handling from the bottom of alltraps and Xdoreti to the top of the new routine. syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after the AST check (already performed for the former, skipped for the latter) Delete a couple debugging hooks mlarkin@ and I used during Meltdown work tested by many in snaps; thanks to brynet@ for spurious interrrupt testing earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@
# 1.96	20-May-2018	guenther	Stash the syscall number in tf_err so it can be reported by the SPL check ok mlarkin@ mpi@
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	branches: 1.94.2; Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.95	26-Apr-2018	guenther	Prefer leaq+%rip-relative over movabsq xrstor_resume must not have profile prologue, so use NENTRY Don't use _C_LABEL() with some pure-ASM labels
Revision tags: OPENBSD_6_3_BASE
# 1.94	21-Feb-2018	guenther	Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	branches: 1.89.2; Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.94	21-Feb-2018	guenther	Meltdown: implement user/kernel page table separation. On Intel CPUs which speculate past user/supervisor page permission checks, use a separate page table for userspace with only the minimum of kernel code and data required for the transitions to/from the kernel (still marked as supervisor-only, of course): - the IDT (RO) - three pages of kernel text in the .kutext section for interrupt, trap, and syscall trampoline code (RX) - one page of kernel data in the .kudata section for TLB flush IPIs (RW) - the lapic page (RW, uncachable) - per CPU: one page for the TSS+GDT (RO) and one page for trampoline stacks (RW) When a syscall, trap, or interrupt takes a CPU from userspace to kernel the trampoline code switches page tables, switches stacks to the thread's real kernel stack, then copies over the necessary bits from the trampoline stack. On return to userspace the opposite occurs: recreate the iretq frame on the trampoline stack, switch stack, switch page tables, and return to userspace. mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing issues on MP in particular, and drove the final push to completion. Many rounds of testing by naddy@, sthen@, and others Thanks to Alex Wilson from Joyent for early discussions about trampolines and their data requirements. Per-CPU page layout mostly inspired by DragonFlyBSD. ok mlarkin@ deraadt@
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)
# 1.93	07-Jan-2018	mlarkin	remove all PG_G global page mappings from the kernel when running on Intel CPUs. Part of an ongoing set of commits to mitigate the Intel "meltdown" CVE. This diff does not confer any immunity to that vulnerability - subsequent commits are still needed and are being worked on presently. ok guenther, deraadt
# 1.92	06-Jan-2018	guenther	Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on every return to userspace. ok kettenis@ mlarkin@
# 1.91	10-Oct-2017	mlarkin	remove a unused variable ok tom, kettenis, deraadt
# 1.90	05-Oct-2017	mlarkin	Clean up some no longer needed includes left over from the locore/locore0 split. ok tom, mpi, deraadt
Revision tags: OPENBSD_6_2_BASE
# 1.89	04-Oct-2017	guenther	Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return from the trap to a 'resume' address to effectively make xrstor_user() return an error indication, then do the FPU cleanup and trap generation from there where we can get access to the original, userspace trapframe. The original fix tried to handle the trap while on the wrong trapframe, leaking kernel addresses and possibly leading to double faults. Problem pointed out by abluhm@ ok deraadt@ mikeb@
# 1.88	03-Oct-2017	guenther	The xrstor instruction will fault if the provided xstate data, which is under userspace control via sigreturn, fails various consistency checks. Rather than trying to replicate the CPU's hardwired checks in C code, handle it like iretq: check in trap() whether a fault is from the problem instruction and handle it there. CPU behavior and the potential issue pointed out on Linux kernel-hardening ok mikeb@ deraadt@
# 1.87	06-Jul-2017	deraadt	0xcc-fill a few more alignments. Not because these ones matter particularily, but because elimination highlights more important ones. Cursory review mortimer, ok mlarkin
# 1.86	29-Jun-2017	deraadt	Put asm-generated strings into .rodata ok millert
# 1.85	31-May-2017	deraadt	Split early startup code out of locore.S into locore0.S. Adjust link run so that this locore0.o is always at the start of the executable. But randomize the link order of all other .o files in the kernel, so that their exec/rodata/data/bss segments land all over the place. Late during kernel boot, unmap the early startup code. As a result, the internal layout of every newly build bsd kernel is different from past kernels. Internal relative offsets are not known to an outside attacker. The only known offsets are in the startup code, which has been unmapped. Ramdisk kernels cannot be compiled like this, because they are gzip'd. When the internal pointer references change, the compression dictionary bloats and results in poorer compression. ok kettenis mlarkin visa, also thanks to tedu for getting me back to this
Revision tags: OPENBSD_6_1_BASE
# 1.84	06-Feb-2017	mpi	branches: 1.84.4; Sync a comment with i386.
# 1.83	04-Sep-2016	mpi	Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel profiling framework. Code patching is used to enable probes when entering functions. The probes will call a mcount()-like function to match the behavior of a GPROF kernel. Currently only available on amd64 and guarded under DDBPROF. Support for other archs will follow soon. A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0 to be able to use this feature. Inputs and ok guenther@
Revision tags: OPENBSD_6_0_BASE
# 1.82	16-Jul-2016	mlarkin	branches: 1.82.2; remove some unused #includes
# 1.81	22-Jun-2016	mikeb	Setup Hyper-V hypercall page and an IDT vector. ok mlarkin, kettenis, deraadt
# 1.80	06-Jun-2016	deraadt	Fill a few more pads with 0xcc ok mikeb, mlarkin
# 1.79	23-May-2016	deraadt	Place a cpu-dependent trap/illegal instruction over the remainder of the sigtramp page, so that it will generate a nice kernel fault if touched. While here, move most of the sigtramps to the .rodata segment, because they are not executed in the kernel. Also some preparation for sliding the actual sigtramp forward (will need some gdb changes) ok mlarkin kettenis
# 1.78	10-May-2016	deraadt	SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie inside the sigcontext. sigreturn(2) checks syscall entry was from the exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie, and clears it to prevent sigcontext reuse. not yet tested on landisk, sparc, *88k, socppc. ok kettenis
# 1.77	10-May-2016	mikeb	Fill Xen hypercall page with int3's like the hypervisor does. Idea from deraadt@ and mlarkin@.
# 1.76	26-Feb-2016	mlarkin	SYMTAB_SPACE is no longer used (last used with a.out ddb)
Revision tags: OPENBSD_5_9_BASE
# 1.75	04-Jan-2016	mlarkin	wrap a long line
# 1.74	08-Dec-2015	mikeb	Setup a hypercall page in the kernel .text segment Its location will be communicated with the Xen hypervisor that will fill it in with instructions resulting in VMEXIT events. Discussed with kettenis@ and deraadt@, with input from and OK mpi, mlarkin, reyk
# 1.73	09-Nov-2015	mlarkin	Cache the result of cpuid leaf function $0x1 from the host's boot CPU during locore, information based on this will be returned to guest VMs issuing cpuid instructions later, under certain circumstances.
Revision tags: OPENBSD_5_8_BASE
# 1.72	17-Jul-2015	guenther	Consistently use SEL_RPL as the mask when testing selector privilege level
# 1.71	17-Jul-2015	mlarkin	"are we 386, 386sx, or 486, or Pentium, or.." I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so delete the (unused) variable that was supposed to track which 32 bit CPU we were running on.
# 1.70	16-Jul-2015	mlarkin	remove 'cpu_brand_id' as we no longer use that method to calculate the name of the cpu. Further, the calculation of cpu_brand_id was in the wrong place to begin with, so it was being calculated incorrectly anyway.
# 1.69	16-Jul-2015	mlarkin	Fix a backward compare in boot argument parsing, and clarify a comment that was wrong. ok guenther@
# 1.68	28-Jun-2015	guenther	Force the return to userspace from execve to go through iretq to get all registers. This lets us kill the special handling of pid 1 in fork and merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used to modify registers. ok mlarkin@ kettenis@
# 1.67	28-Jun-2015	guenther	Split AST handling from trap() into ast() and get rid of T_ASTFLT. Don't skip the AST check when returning from *fork() in the child. Make sure to count interrupts even when they're deferred or stray. testing by krw@, and then many via snapshots
# 1.66	23-Jun-2015	bluhm	If the kernel symbols fit completely into the 2 MB alignment hole after kernel bss but before end of the image, the page tables used the read-only mapping of the hole. When booting a small non-generic kernel, this resulted in a crash, while writing to the page tables later. Make sure that the page tables are created after esym and after end. OK mlarkin@ deraadt@
# 1.65	18-May-2015	guenther	Do lazy update/reset of the FS.base and %[def]s segment registers: reseting segment registers in cpu_switchto if the old thread had made it to userspace and restoring FS.base only on first return to userspace since context switch. ok mlarkin@
# 1.64	18-Apr-2015	guenther	i386 and amd64 have only one syscall entry point now, so simply the EIP/RIP adjustment for ERESTART ok mlarkin@
# 1.63	22-Mar-2015	guenther	Explain the state on syscall entry
Revision tags: OPENBSD_5_7_BASE
# 1.62	16-Jan-2015	sf	Binary code patching on amd64 This commit adds generic infrastructure to do binary code patching on amd64. The existing code patching for SMAP is converted to the new infrastruture. More consumers and support for i386 will follow later. This version of the diff has some simplifications in codepatch_fill_nop() compared to a version that was: OK @kettenis @mlarkin @jsg
# 1.61	21-Dec-2014	mlarkin	Prevent writing to the kernel area via the direct map. We do this by padding the end of the kernel area to 2MB, so that the direct map pages can then have the W permission removed (X permission was already removed in a previous diff). This creates a VA hole at the end of bss, so adjust for that since that's where symbols get loaded by the bootloader (for now, map that region RO until the boot loader can be updated to place the symbols at "end" instead of "end of bss"). with help from and ok deraadt@
# 1.60	27-Nov-2014	mlarkin	Missing comparison caused NX to always be enabled during boot, even on CPUs that may have had it disabled in BIOS. ok deraadt@
# 1.59	20-Nov-2014	mlarkin	When removing the identity mapping in low memory used during bootstrap, there is no reason to keep the NX bit around on null PTEs (PTEs that have been removed).
# 1.58	20-Nov-2014	mlarkin	Move previous PTE permission fixup code into locore, and fixup some more ranges while we're there. ok deraadt@, tested by many and in snaps
# 1.57	07-Nov-2014	mlarkin	Wrong comment - NX is handled later (for now), not in locore. No functional change. noticed by deraadt@
# 1.56	05-Nov-2014	mlarkin	Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt. ok deraadt@
# 1.55	09-Oct-2014	tedu	no need for lkm_map now
Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54	10-Nov-2012	mglocker	Recent x86 CPUs come with a constant time stamp counter. If this is the case we verify if the CPU supports a specific version of the architectural performance monitoring feature and read out the current frequency from the fixed-function performance counter of the unhalted core. My initial motivation to implement this was the Soekris net6501-70 which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant time stamp counter plus speed step support and boots on the lowest frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to reflect the wrong values. The diff is a cooperation work with jsg@. The fixed-function performance counter read code comes from a former diff of him. OK jsg@
# 1.53	25-Sep-2012	pirofti	Remove unused acpi locking code. To be replaced with higher level C routines once we settle for a common consistent set of atomic operations across platforms. Discussed with and okay by deraadt@ and kettenis@.
Revision tags: OPENBSD_5_2_BASE
# 1.52	06-May-2012	guenther	Garbage collect the old int$80 kernel entry point: the last use of it by the not-normally-used sigreturn() stub in libc was changed to use 'syscall' instruction in 5.0 ok mikeb@ jsg@
Revision tags: OPENBSD_5_1_BASE
# 1.51	26-Dec-2011	haesbaert	Add the missing ECX cpu flags from CPUID at 0x80000001. This is all documented at: http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20) http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41) ok jsg@
# 1.50	12-Oct-2011	miod	Remove all MD diagnostics in cpu_switchto(), and move them to MI code if they apply. ok oga@ deraadt@
# 1.49	03-Sep-2011	guenther	Add a general warning about gdb matching against sigcode instructions
Revision tags: OPENBSD_5_0_BASE
# 1.48	04-Jul-2011	guenther	Force the sigreturn syscall to return to userspace via iretq by setting the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel via syscall instead of int$80. Rearrange the return paths in both the sysretq and iretq paths to reduce how long interrupts are blocked and shave instructions. ok kettenis@, extra testing krw@
# 1.47	13-Apr-2011	guenther	Unrevert the FS.base diff: the issues were actually elsewhere Additional testing by jasper@ and pea@
# 1.46	10-Apr-2011	guenther	Revert bulk of the FS.base diff, as it causes issues on some machines and the problem isn't obvious yet.
# 1.45	05-Apr-2011	guenther	Add support for per-rthread base-offset for the %fs selector on amd64. Add pcb_fsbase to the PCB for tracking what the value for the thread is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current value for FS.base, then on return to user-space, skip the setting if the CPU has the right value already. Non-threaded processes without TLS leave FS.base zero, which can be conveniently optimized: setting %fs zeros FS.base for fewer cycles than wrmsr. ok kettenis@
Revision tags: OPENBSD_4_9_BASE
# 1.44	04-Dec-2010	guenther	The pm_cpus member of the pmap is now a 64bit integer: update the assembly used in cpu_switch() for handling it. Also, delete an unnecessary instruction that I added while debugging the pm_cpus handling before ok kettenis@
# 1.43	13-Nov-2010	guenther	Switch from TSS-per-process to TSS-per-CPU, placing the TSS right next to the cpu's GDT, also making the double-fault stack per-CPU, leaving it at the top of the page of the CPU's idle process. Inline pmap_activate() and pmap_deactivate() into the asm cpu_switchto routine, adding a check for the new pmap already being marked as active on the CPU. Garbage collect the hasn't-been-used-in-years GDT update IPI. Tested by many; ok mikeb@, kettenis@
# 1.42	26-Oct-2010	guenther	The LDT is only used by dead compat code now, so load the ldt register with the null selector (disabling use of it), stop reloading it on every context switch, and blow away the table itself, as well as the pcb and pmap bits that were used to track it. Also, delete two other unused pcb members: pcb_usersp and pcb_flags. (Deleting pcb_usersp also keeps the pcb_savefpu member aligned properly.) Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT sysarch() calls. Tested by various with both AMD and Intel chips ok mikeb@
# 1.41	14-Oct-2010	guenther	Clean up segment handling: switch user-space to using code and data segments in the GDT instead of the LDT and eliminate the GDT slots that we don't actually use. tested on both amd and intel by several not really the right person, but ok: kettenis@
# 1.40	28-Sep-2010	guenther	Correct the handling of GS.base when iretq faults: the fault happens with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling won't work. Contrawise, the asm that trap() redirects us to when that happens (resume_iret) sees a trapframe showing CPL==3 but it's run with the kernel's GS.base, so INTRENTRY won't work there either. asm style fixes drahn@ and mikeb@ ok kettenis@
Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39	09-Jun-2009	krw	revert guenther@'s un-revert of art's curpmap. My bios0: ASUSTeK Computer INC. P5K-E cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz can't boot with this in. It always hangs somewhere in fsck'ing if any, or between netstart and local daemons if no fsck'ing. Also fubars theo's real amd machine. Much more testing needed for this.
# 1.38	06-Jun-2009	guenther	Unrevert the curpmap change with the addition of correct %gs handling in the IPI handler so that it works when it interrupts userspace, waiting for the droppmap IPI to complete when destroying it, and (most importantly) don't call pmap_tlb_droppmap() from cpu_exit(). Tested by myself and ckuethe, as our machines choked on the original. ok @art
# 1.37	05-Jun-2009	guenther	Revert the curpmap change. We know the IPI is broken on both ends, but even with proposed fixes, the reaper panics are back.
# 1.36	02-Jun-2009	jordan	Added interface for cpu idle on amd64 ok gwk@, toby@, marco@
# 1.35	28-May-2009	art	Bring back the curpmap change. It was missing a reload of the pmap on curcpu when we were freeing a pmap. Tested and working for a few weeks now, but I was a bit too busy to commit it earlier.
# 1.34	27-Apr-2009	deraadt	turning pmap_deactivate into a NOP brought back the reaper panics, probably because the reaper is running on the mappings of pmap from the process it is about to unmap. back it out until ht is fixed right; don't let this sit in the tree waiting for a fix.
# 1.33	23-Apr-2009	art	Make pmap_deactivate a NOP. Instead of keeping a bitmask of on which cpu the pmap might be active which we clear in pmap_deactivate, always keep a pointer to the currently loaded pmap in cpu_info. We can now optimize a context switch to the kernel pmap (idle and kernel threads) to keep the previously loaded pmap still loaded and then reuse that pmap if we context switch back to the same process. Introduce a new IPI to force a pmap reload before the pmap is destroyed. Clean up cpu_switchto. toby@ ok
# 1.32	31-Mar-2009	art	- remove obsolete comment - remove dead (#if 0) code - move switch_error panics to after cpu_switchto to make branch prediction happier and the code more readable. no functional change
Revision tags: OPENBSD_4_5_BASE
# 1.31	15-Feb-2009	mikeb	Set the limit of the GDT table to its size - 1. Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks! Checked with kettenis@. ok kettenis
# 1.30	12-Nov-2008	weingart	Add a comment to sigcode() to explain why the use of 'int $0x80' is necessary, so that future hackers will not be mislead the same way I was when looking at this code.
# 1.29	24-Oct-2008	deraadt	remove unused label
# 1.28	13-Aug-2008	weingart	This tab had bugged me forever.
Revision tags: OPENBSD_4_4_BASE
# 1.27	28-Jul-2008	miod	No longer clear ci_want_resched within cpu_switchto(), now that it's done in the MI code.
# 1.26	27-Jun-2008	ray	More removal of clauses 3 and 4 from NetBSD licenses. OK deraadt@ and millert@
Revision tags: OPENBSD_4_3_BASE
# 1.25	03-Nov-2007	gwk	Add acpi_acquire_global_lock(), and acpi_release_global_lock to amd64 the not ghetto architecture. ok toby@
# 1.24	10-Oct-2007	art	Make context switching much more MI: - Move the functionality of choosing a process from cpu_switch into a much simpler function: cpu_switchto. Instead of having the locore code walk the run queues, let the MI code choose the process we want to run and only implement the context switching itself in MD code. - Let MD context switching run without worrying about spls or locks. - Instead of having the idle loop implemented with special contexts in MD code, implement one idle proc for each cpu. make the idle loop MI with MD hooks. - Change the proc lists from the old style vax queues to TAILQs. - Change the sleep queue from vax queues to TAILQs. This makes wakeup() go from O(n^2) to O(n) there will be some MD fallout, but it will be fixed shortly. There's also a few cleanups to be done after this. deraadt@, kettenis@ ok
# 1.23	12-Sep-2007	deraadt	port of i386 pctr code to amd64; Mike Belopuhov
Revision tags: OPENBSD_4_2_BASE
# 1.22	27-May-2007	art	- Redo the way we set up the direct map. Map the first 4GB of it in locore so that we can use the direct map in pmap_bootstrap when setting up the initial page tables. - Introduce a second direct map (I love large address spaces) with uncached pages. jason@ ok
Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21	20-Aug-2005	jsg	Check for and report the presense of SSE3. This has started to appear in AMD products with the arrival of the venice core. ok deraadt@
# 1.20	26-Jul-2005	art	Instead of juggling around with cr4 and enabling parts of it sometimes, other parts later, etc. Just set it to the same default value everywhere. We won't survive without PSE and tt's not like someone will suddenly make an amd64 that doesn't support PGE. This will allow us to make the bootstrap process slightly more sane.
# 1.19	29-May-2005	deraadt	sched work by niklas and art backed out; causes panics
# 1.18	27-May-2005	art	Stop pretending that amd64 is i386. We're insulting the cpu by not even pretending to use all the address space it gives us. - Map all physical memory 1-1 and implement PMAP_DIRECT - Remove the vast magic we do to map pages for pmap_zero_page, pmap_copy_page, pv allocation, magic while bootstrapping, reading of /dev/mem, etc. - implement a fast pmap_zero_page based on sse instructions. I love removing code. More to come. deraadt@ ok tested by many.
# 1.17	25-May-2005	niklas	This patch is mortly art's work and was done a year ago. Art wants to thank everyone for the prompt review and ok of this work ;-) Yeah, that includes me too, or maybe especially me. I am sorry. Change the sched_lock to a mutex. This fixes, among other things, the infamous "telnet localhost &" problem. The real bug in that case was that the sched_lock which is by design a non-recursive lock, was recursively acquired, and not enough releases made us hold the lock in the idle loop, blocking scheduling on the other processors. Some of the other processors would hold the biglock though, which made it impossible for cpu 0 to enter the kernel... A nice deadlock. Let me just say debugging this for days just to realize that it was all fixed in an old diff noone ever ok'd was somewhat of an anti-climax. This diff also changes splsched to be correct for all our architectures.
Revision tags: OPENBSD_3_7_BASE
# 1.16	06-Jan-2005	martin	missing $OpenBSD$
# 1.15	01-Jan-2005	millert	gcc 3.3.5 will store zero-initialized variables in bss by default, move bootdev to data so it doesn't get zapped when bss is cleared. deraadt@ OK
Revision tags: OPENBSD_3_6_BASE
# 1.14	25-Jun-2004	art	SMP support. Big parts from NetBSD, but with some really serious debugging done by me, niklas and others. Especially wrt. NXE support. Still needs some polishing, especially in dmesg messages, but we're now building kernel faster than ever.
# 1.13	22-Jun-2004	art	Switch amd64 to __HAVE_CPUINFO deraadt@ ok
# 1.12	21-Jun-2004	niklas	Pure luck has protected us from this bug until now: locore.S %r9 are not saved over function calls and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.
# 1.11	13-Jun-2004	niklas	debranch SMP, have fun
Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10	13-May-2004	sturm	activate systrace on amd64, while here get rid of syscall_{plain,fancy} instead use syscall() as everywhere else ok mickey, tested and ok tedu@
Revision tags: OPENBSD_3_5_BASE
# 1.9	25-Feb-2004	deraadt	dkcsum stuff for amd64, written by tom, who cannot commit it at the moment. now the amd64 knows what drive it was booted from.
# 1.8	23-Feb-2004	mickey	the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems
# 1.7	23-Feb-2004	mickey	get use of NX; partially from netbsd; passes the regress; deraadt@ ok
# 1.6	23-Feb-2004	tom	- Pick up the /boot argc, argv in locore.S (though not currently used) - Probe for console devices (incl serial) in /boot - Pass console device from /boot to kernel (temp via additional param) With this, boot> set tty com0 now works. "just don't break a build" deraadt@
# 1.5	22-Feb-2004	tom	- Make comment about parameters passed by /boot reflect reality - Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC() does this itself ok mickey@
# 1.4	20-Feb-2004	deraadt	use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed. we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl
# 1.3	07-Feb-2004	miod	branches: 1.3.2; Be sure to flag pte constants as UL, and cope with this in locore. ok deraadt@
# 1.2	03-Feb-2004	mickey	das boot; das cloned das from das i386
# 1.1	28-Jan-2004	mickey	an amd64 arch support. hacked by art@ from netbsd sources and then later debugged by me into the shape where it can host itself. no bootloader yet as needs redoing from the recent advanced i386 sources (anyone? ;)