History log of /openbsd-current/sys/arch/amd64/amd64/locore.S
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.147 17-Mar-2024 guenther

Use VERW to mitigate the RFDS (Register File Data Sampling) vulnerability
present in Intel Atom CPUs, reordering some ASM in return-to-userspace and
start/resume-vmx-guest to reduce the number of kernel values still live in
registers when VERW is used. This mitigation requires updated firmware which
has affected CPUs report RFDS_CLEAR in dmesg.

Firmware packaging by jsg@ and sthen@
Logic for interpreting intel's flags by jsg@ after lots of discussion
between him, deraadt@, and I
ok deraadt@


Revision tags: OPENBSD_7_5_BASE
# 1.146 25-Feb-2024 guenther

We don't do compat32 so MSR_CSTAR shouldn't be set up: delete the
Xsyscall32 stub and UCODE32 selector, set MSR_CSTAR to zero at CPU
startup, and rezero on ACPI resume and VM exit.

requested a while ago by deraadt@
AMD VM testing chris@
testing and ok krw@


# 1.145 12-Feb-2024 guenther

Retpolines are an anti-pattern for IBT, so we need to shift protecting
userspace from cross-process BTI to the kernel. Have each CPU track
the last pmap run on in userspace and the last vmm VCPU in guest-mode
and use the IBPB msr to flush predictors right before running in
userspace on a different pmap or entering guest-mode on a different
VCPU. Codepatch-nop the userspace bits and conditionalize the vmm
bits to keep working if IBPB isn't supported.

ok deraadt@ kettenis@


# 1.144 12-Dec-2023 deraadt

remove support for syscall(2) -- the "indirection system call" because
it is a dangerous alternative entry point for all system calls, and thus
incompatible with the precision system call entry point scheme we are
heading towards. This has been a 3-year mission:
First perl needed a code-generated wrapper to fake syscall(2) as a giant
switch table, then all the ports were cleaned with relatively minor fixes,
except for "go". "go" required two fixes -- 1) a framework issue with
old library versions, and 2) like perl, a fake syscall(2) wrapper to
handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over
the place in the "go" ecosystem because the "go developers" are plan9-loving
unix-hating folk who tried to build an ecosystem without allowing "ioctl".
ok kettenis, jsing, afresh1, sthen


# 1.143 12-Dec-2023 deraadt

The sigtramp was calling sigreturn(2), and upon failure exit(2), which
doesn't make sense anymore. It is better to just issue an illegal
instruction.
ok kettenis, with some misgivings about inconsistant approaches between
architectures.
In the future we could change sigreturn(2) to never return an exit code,
but always just terminate the process. We stopped this system call
from being callable ages ago with msyscall(2), and there is no stub for
it in libc.. maybe that's the next step to take?


# 1.142 10-Dec-2023 deraadt

Add a new label "sigcodecall" inside every sigtramp definition, directly
in front of the syscall instruction. This is used to calculate the start
of the syscall for SYS_sigreturn and pinned system calls.
ok kettenis


# 1.141 24-Oct-2023 claudio

Normally context switches happen in mi_switch() but there are 3 cases
where a switch happens outside. Cleanup these code paths and make the
machine independent.

- when a process forks (fork, tfork, kthread), the new proc needs to
somehow be scheduled for the first time. This is done by proc_trampoline.
Since proc_trampoline is machine dependent assembler code change
the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make
sure it is now always called.
- cpu_hatch: when booting APs the code needs to jump to the first proc
running on that CPU. This should be the idle thread for that CPU.
- sched_exit: when a proc exits it needs to switch away from itself and
then instruct the reaper to clean up the rest. This is done by switching
to the idle loop.

Since the last two cases require a context switch to the idle proc factor
out the common code to sched_toidle() and use it in those places.

Tested by many on all archs.
OK miod@ mpi@ cheloha@


Revision tags: OPENBSD_7_4_BASE
# 1.140 31-Jul-2023 guenther

On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation")
or IBT enabled the kernel, the hardware should the attacks which
retpolines were created to prevent. In those cases, retpolines
should be a net negative for security as they are an indirect branch
gadget. They're also slower.
* use -mretpoline-external-thunk to give us control of the code
used for indirect branches
* default to using a retpoline as before, but marks it and the
other ASM kernel retpolines for code patching
* if the CPU has eIBRS, then enable it
* if the CPU has eIBRS *or* IBT, then codepatch the three different
retpolines to just indirect jumps

make clean && make config required after this

ok kettenis@


# 1.139 28-Jul-2023 guenther

Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk
of code to use in codepatching. Use that for all the existing
codepatching snippets.

Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also
provides a short variable holding the length of the codepatch snippet.
Use that for some snippets that will be used for retpoline replacement.

ok kettenis@ deraadt@


# 1.138 27-Jul-2023 guenther

Follow the lead of mips64 and make cpu_idle_cycle() just call the
indirect pointer itself and provide an initializer for that going
to the default "just enable interrupts and halt" path.

ok kettenis@


# 1.137 25-Jul-2023 guenther

cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define
away the calls

ok deraadt@ mpi@ miod@


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.146 25-Feb-2024 guenther

We don't do compat32 so MSR_CSTAR shouldn't be set up: delete the
Xsyscall32 stub and UCODE32 selector, set MSR_CSTAR to zero at CPU
startup, and rezero on ACPI resume and VM exit.

requested a while ago by deraadt@
AMD VM testing chris@
testing and ok krw@


# 1.145 12-Feb-2024 guenther

Retpolines are an anti-pattern for IBT, so we need to shift protecting
userspace from cross-process BTI to the kernel. Have each CPU track
the last pmap run on in userspace and the last vmm VCPU in guest-mode
and use the IBPB msr to flush predictors right before running in
userspace on a different pmap or entering guest-mode on a different
VCPU. Codepatch-nop the userspace bits and conditionalize the vmm
bits to keep working if IBPB isn't supported.

ok deraadt@ kettenis@


# 1.144 12-Dec-2023 deraadt

remove support for syscall(2) -- the "indirection system call" because
it is a dangerous alternative entry point for all system calls, and thus
incompatible with the precision system call entry point scheme we are
heading towards. This has been a 3-year mission:
First perl needed a code-generated wrapper to fake syscall(2) as a giant
switch table, then all the ports were cleaned with relatively minor fixes,
except for "go". "go" required two fixes -- 1) a framework issue with
old library versions, and 2) like perl, a fake syscall(2) wrapper to
handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over
the place in the "go" ecosystem because the "go developers" are plan9-loving
unix-hating folk who tried to build an ecosystem without allowing "ioctl".
ok kettenis, jsing, afresh1, sthen


# 1.143 12-Dec-2023 deraadt

The sigtramp was calling sigreturn(2), and upon failure exit(2), which
doesn't make sense anymore. It is better to just issue an illegal
instruction.
ok kettenis, with some misgivings about inconsistant approaches between
architectures.
In the future we could change sigreturn(2) to never return an exit code,
but always just terminate the process. We stopped this system call
from being callable ages ago with msyscall(2), and there is no stub for
it in libc.. maybe that's the next step to take?


# 1.142 10-Dec-2023 deraadt

Add a new label "sigcodecall" inside every sigtramp definition, directly
in front of the syscall instruction. This is used to calculate the start
of the syscall for SYS_sigreturn and pinned system calls.
ok kettenis


# 1.141 24-Oct-2023 claudio

Normally context switches happen in mi_switch() but there are 3 cases
where a switch happens outside. Cleanup these code paths and make the
machine independent.

- when a process forks (fork, tfork, kthread), the new proc needs to
somehow be scheduled for the first time. This is done by proc_trampoline.
Since proc_trampoline is machine dependent assembler code change
the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make
sure it is now always called.
- cpu_hatch: when booting APs the code needs to jump to the first proc
running on that CPU. This should be the idle thread for that CPU.
- sched_exit: when a proc exits it needs to switch away from itself and
then instruct the reaper to clean up the rest. This is done by switching
to the idle loop.

Since the last two cases require a context switch to the idle proc factor
out the common code to sched_toidle() and use it in those places.

Tested by many on all archs.
OK miod@ mpi@ cheloha@


Revision tags: OPENBSD_7_4_BASE
# 1.140 31-Jul-2023 guenther

On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation")
or IBT enabled the kernel, the hardware should the attacks which
retpolines were created to prevent. In those cases, retpolines
should be a net negative for security as they are an indirect branch
gadget. They're also slower.
* use -mretpoline-external-thunk to give us control of the code
used for indirect branches
* default to using a retpoline as before, but marks it and the
other ASM kernel retpolines for code patching
* if the CPU has eIBRS, then enable it
* if the CPU has eIBRS *or* IBT, then codepatch the three different
retpolines to just indirect jumps

make clean && make config required after this

ok kettenis@


# 1.139 28-Jul-2023 guenther

Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk
of code to use in codepatching. Use that for all the existing
codepatching snippets.

Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also
provides a short variable holding the length of the codepatch snippet.
Use that for some snippets that will be used for retpoline replacement.

ok kettenis@ deraadt@


# 1.138 27-Jul-2023 guenther

Follow the lead of mips64 and make cpu_idle_cycle() just call the
indirect pointer itself and provide an initializer for that going
to the default "just enable interrupts and halt" path.

ok kettenis@


# 1.137 25-Jul-2023 guenther

cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define
away the calls

ok deraadt@ mpi@ miod@


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.145 12-Feb-2024 guenther

Retpolines are an anti-pattern for IBT, so we need to shift protecting
userspace from cross-process BTI to the kernel. Have each CPU track
the last pmap run on in userspace and the last vmm VCPU in guest-mode
and use the IBPB msr to flush predictors right before running in
userspace on a different pmap or entering guest-mode on a different
VCPU. Codepatch-nop the userspace bits and conditionalize the vmm
bits to keep working if IBPB isn't supported.

ok deraadt@ kettenis@


# 1.144 12-Dec-2023 deraadt

remove support for syscall(2) -- the "indirection system call" because
it is a dangerous alternative entry point for all system calls, and thus
incompatible with the precision system call entry point scheme we are
heading towards. This has been a 3-year mission:
First perl needed a code-generated wrapper to fake syscall(2) as a giant
switch table, then all the ports were cleaned with relatively minor fixes,
except for "go". "go" required two fixes -- 1) a framework issue with
old library versions, and 2) like perl, a fake syscall(2) wrapper to
handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over
the place in the "go" ecosystem because the "go developers" are plan9-loving
unix-hating folk who tried to build an ecosystem without allowing "ioctl".
ok kettenis, jsing, afresh1, sthen


# 1.143 12-Dec-2023 deraadt

The sigtramp was calling sigreturn(2), and upon failure exit(2), which
doesn't make sense anymore. It is better to just issue an illegal
instruction.
ok kettenis, with some misgivings about inconsistant approaches between
architectures.
In the future we could change sigreturn(2) to never return an exit code,
but always just terminate the process. We stopped this system call
from being callable ages ago with msyscall(2), and there is no stub for
it in libc.. maybe that's the next step to take?


# 1.142 10-Dec-2023 deraadt

Add a new label "sigcodecall" inside every sigtramp definition, directly
in front of the syscall instruction. This is used to calculate the start
of the syscall for SYS_sigreturn and pinned system calls.
ok kettenis


# 1.141 24-Oct-2023 claudio

Normally context switches happen in mi_switch() but there are 3 cases
where a switch happens outside. Cleanup these code paths and make the
machine independent.

- when a process forks (fork, tfork, kthread), the new proc needs to
somehow be scheduled for the first time. This is done by proc_trampoline.
Since proc_trampoline is machine dependent assembler code change
the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make
sure it is now always called.
- cpu_hatch: when booting APs the code needs to jump to the first proc
running on that CPU. This should be the idle thread for that CPU.
- sched_exit: when a proc exits it needs to switch away from itself and
then instruct the reaper to clean up the rest. This is done by switching
to the idle loop.

Since the last two cases require a context switch to the idle proc factor
out the common code to sched_toidle() and use it in those places.

Tested by many on all archs.
OK miod@ mpi@ cheloha@


Revision tags: OPENBSD_7_4_BASE
# 1.140 31-Jul-2023 guenther

On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation")
or IBT enabled the kernel, the hardware should the attacks which
retpolines were created to prevent. In those cases, retpolines
should be a net negative for security as they are an indirect branch
gadget. They're also slower.
* use -mretpoline-external-thunk to give us control of the code
used for indirect branches
* default to using a retpoline as before, but marks it and the
other ASM kernel retpolines for code patching
* if the CPU has eIBRS, then enable it
* if the CPU has eIBRS *or* IBT, then codepatch the three different
retpolines to just indirect jumps

make clean && make config required after this

ok kettenis@


# 1.139 28-Jul-2023 guenther

Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk
of code to use in codepatching. Use that for all the existing
codepatching snippets.

Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also
provides a short variable holding the length of the codepatch snippet.
Use that for some snippets that will be used for retpoline replacement.

ok kettenis@ deraadt@


# 1.138 27-Jul-2023 guenther

Follow the lead of mips64 and make cpu_idle_cycle() just call the
indirect pointer itself and provide an initializer for that going
to the default "just enable interrupts and halt" path.

ok kettenis@


# 1.137 25-Jul-2023 guenther

cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define
away the calls

ok deraadt@ mpi@ miod@


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.144 12-Dec-2023 deraadt

remove support for syscall(2) -- the "indirection system call" because
it is a dangerous alternative entry point for all system calls, and thus
incompatible with the precision system call entry point scheme we are
heading towards. This has been a 3-year mission:
First perl needed a code-generated wrapper to fake syscall(2) as a giant
switch table, then all the ports were cleaned with relatively minor fixes,
except for "go". "go" required two fixes -- 1) a framework issue with
old library versions, and 2) like perl, a fake syscall(2) wrapper to
handle ioctl(2) and sysctl(2) because "syscall(SYS_ioctl" occurs all over
the place in the "go" ecosystem because the "go developers" are plan9-loving
unix-hating folk who tried to build an ecosystem without allowing "ioctl".
ok kettenis, jsing, afresh1, sthen


# 1.143 12-Dec-2023 deraadt

The sigtramp was calling sigreturn(2), and upon failure exit(2), which
doesn't make sense anymore. It is better to just issue an illegal
instruction.
ok kettenis, with some misgivings about inconsistant approaches between
architectures.
In the future we could change sigreturn(2) to never return an exit code,
but always just terminate the process. We stopped this system call
from being callable ages ago with msyscall(2), and there is no stub for
it in libc.. maybe that's the next step to take?


# 1.142 10-Dec-2023 deraadt

Add a new label "sigcodecall" inside every sigtramp definition, directly
in front of the syscall instruction. This is used to calculate the start
of the syscall for SYS_sigreturn and pinned system calls.
ok kettenis


# 1.141 24-Oct-2023 claudio

Normally context switches happen in mi_switch() but there are 3 cases
where a switch happens outside. Cleanup these code paths and make the
machine independent.

- when a process forks (fork, tfork, kthread), the new proc needs to
somehow be scheduled for the first time. This is done by proc_trampoline.
Since proc_trampoline is machine dependent assembler code change
the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make
sure it is now always called.
- cpu_hatch: when booting APs the code needs to jump to the first proc
running on that CPU. This should be the idle thread for that CPU.
- sched_exit: when a proc exits it needs to switch away from itself and
then instruct the reaper to clean up the rest. This is done by switching
to the idle loop.

Since the last two cases require a context switch to the idle proc factor
out the common code to sched_toidle() and use it in those places.

Tested by many on all archs.
OK miod@ mpi@ cheloha@


Revision tags: OPENBSD_7_4_BASE
# 1.140 31-Jul-2023 guenther

On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation")
or IBT enabled the kernel, the hardware should the attacks which
retpolines were created to prevent. In those cases, retpolines
should be a net negative for security as they are an indirect branch
gadget. They're also slower.
* use -mretpoline-external-thunk to give us control of the code
used for indirect branches
* default to using a retpoline as before, but marks it and the
other ASM kernel retpolines for code patching
* if the CPU has eIBRS, then enable it
* if the CPU has eIBRS *or* IBT, then codepatch the three different
retpolines to just indirect jumps

make clean && make config required after this

ok kettenis@


# 1.139 28-Jul-2023 guenther

Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk
of code to use in codepatching. Use that for all the existing
codepatching snippets.

Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also
provides a short variable holding the length of the codepatch snippet.
Use that for some snippets that will be used for retpoline replacement.

ok kettenis@ deraadt@


# 1.138 27-Jul-2023 guenther

Follow the lead of mips64 and make cpu_idle_cycle() just call the
indirect pointer itself and provide an initializer for that going
to the default "just enable interrupts and halt" path.

ok kettenis@


# 1.137 25-Jul-2023 guenther

cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define
away the calls

ok deraadt@ mpi@ miod@


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.142 10-Dec-2023 deraadt

Add a new label "sigcodecall" inside every sigtramp definition, directly
in front of the syscall instruction. This is used to calculate the start
of the syscall for SYS_sigreturn and pinned system calls.
ok kettenis


# 1.141 24-Oct-2023 claudio

Normally context switches happen in mi_switch() but there are 3 cases
where a switch happens outside. Cleanup these code paths and make the
machine independent.

- when a process forks (fork, tfork, kthread), the new proc needs to
somehow be scheduled for the first time. This is done by proc_trampoline.
Since proc_trampoline is machine dependent assembler code change
the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make
sure it is now always called.
- cpu_hatch: when booting APs the code needs to jump to the first proc
running on that CPU. This should be the idle thread for that CPU.
- sched_exit: when a proc exits it needs to switch away from itself and
then instruct the reaper to clean up the rest. This is done by switching
to the idle loop.

Since the last two cases require a context switch to the idle proc factor
out the common code to sched_toidle() and use it in those places.

Tested by many on all archs.
OK miod@ mpi@ cheloha@


Revision tags: OPENBSD_7_4_BASE
# 1.140 31-Jul-2023 guenther

On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation")
or IBT enabled the kernel, the hardware should the attacks which
retpolines were created to prevent. In those cases, retpolines
should be a net negative for security as they are an indirect branch
gadget. They're also slower.
* use -mretpoline-external-thunk to give us control of the code
used for indirect branches
* default to using a retpoline as before, but marks it and the
other ASM kernel retpolines for code patching
* if the CPU has eIBRS, then enable it
* if the CPU has eIBRS *or* IBT, then codepatch the three different
retpolines to just indirect jumps

make clean && make config required after this

ok kettenis@


# 1.139 28-Jul-2023 guenther

Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk
of code to use in codepatching. Use that for all the existing
codepatching snippets.

Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also
provides a short variable holding the length of the codepatch snippet.
Use that for some snippets that will be used for retpoline replacement.

ok kettenis@ deraadt@


# 1.138 27-Jul-2023 guenther

Follow the lead of mips64 and make cpu_idle_cycle() just call the
indirect pointer itself and provide an initializer for that going
to the default "just enable interrupts and halt" path.

ok kettenis@


# 1.137 25-Jul-2023 guenther

cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define
away the calls

ok deraadt@ mpi@ miod@


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.141 24-Oct-2023 claudio

Normally context switches happen in mi_switch() but there are 3 cases
where a switch happens outside. Cleanup these code paths and make the
machine independent.

- when a process forks (fork, tfork, kthread), the new proc needs to
somehow be scheduled for the first time. This is done by proc_trampoline.
Since proc_trampoline is machine dependent assembler code change
the MP specific proc_trampoline_mp() to proc_trampoline_mi() and make
sure it is now always called.
- cpu_hatch: when booting APs the code needs to jump to the first proc
running on that CPU. This should be the idle thread for that CPU.
- sched_exit: when a proc exits it needs to switch away from itself and
then instruct the reaper to clean up the rest. This is done by switching
to the idle loop.

Since the last two cases require a context switch to the idle proc factor
out the common code to sched_toidle() and use it in those places.

Tested by many on all archs.
OK miod@ mpi@ cheloha@


Revision tags: OPENBSD_7_4_BASE
# 1.140 31-Jul-2023 guenther

On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation")
or IBT enabled the kernel, the hardware should the attacks which
retpolines were created to prevent. In those cases, retpolines
should be a net negative for security as they are an indirect branch
gadget. They're also slower.
* use -mretpoline-external-thunk to give us control of the code
used for indirect branches
* default to using a retpoline as before, but marks it and the
other ASM kernel retpolines for code patching
* if the CPU has eIBRS, then enable it
* if the CPU has eIBRS *or* IBT, then codepatch the three different
retpolines to just indirect jumps

make clean && make config required after this

ok kettenis@


# 1.139 28-Jul-2023 guenther

Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk
of code to use in codepatching. Use that for all the existing
codepatching snippets.

Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also
provides a short variable holding the length of the codepatch snippet.
Use that for some snippets that will be used for retpoline replacement.

ok kettenis@ deraadt@


# 1.138 27-Jul-2023 guenther

Follow the lead of mips64 and make cpu_idle_cycle() just call the
indirect pointer itself and provide an initializer for that going
to the default "just enable interrupts and halt" path.

ok kettenis@


# 1.137 25-Jul-2023 guenther

cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define
away the calls

ok deraadt@ mpi@ miod@


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.140 31-Jul-2023 guenther

On CPUs with eIBRS ("enhanced Indirect Branch Restricted Speculation")
or IBT enabled the kernel, the hardware should the attacks which
retpolines were created to prevent. In those cases, retpolines
should be a net negative for security as they are an indirect branch
gadget. They're also slower.
* use -mretpoline-external-thunk to give us control of the code
used for indirect branches
* default to using a retpoline as before, but marks it and the
other ASM kernel retpolines for code patching
* if the CPU has eIBRS, then enable it
* if the CPU has eIBRS *or* IBT, then codepatch the three different
retpolines to just indirect jumps

make clean && make config required after this

ok kettenis@


# 1.139 28-Jul-2023 guenther

Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk
of code to use in codepatching. Use that for all the existing
codepatching snippets.

Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also
provides a short variable holding the length of the codepatch snippet.
Use that for some snippets that will be used for retpoline replacement.

ok kettenis@ deraadt@


# 1.138 27-Jul-2023 guenther

Follow the lead of mips64 and make cpu_idle_cycle() just call the
indirect pointer itself and provide an initializer for that going
to the default "just enable interrupts and halt" path.

ok kettenis@


# 1.137 25-Jul-2023 guenther

cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define
away the calls

ok deraadt@ mpi@ miod@


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.139 28-Jul-2023 guenther

Add CODEPATCH_CODE() macro to simplify defining a symbol for a chunk
of code to use in codepatching. Use that for all the existing
codepatching snippets.

Similarly, add CODEPATCH_CODE_LEN() which is CODEPATCH_CODE() but also
provides a short variable holding the length of the codepatch snippet.
Use that for some snippets that will be used for retpoline replacement.

ok kettenis@ deraadt@


# 1.138 27-Jul-2023 guenther

Follow the lead of mips64 and make cpu_idle_cycle() just call the
indirect pointer itself and provide an initializer for that going
to the default "just enable interrupts and halt" path.

ok kettenis@


# 1.137 25-Jul-2023 guenther

cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define
away the calls

ok deraadt@ mpi@ miod@


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.137 25-Jul-2023 guenther

cpu_idle_{enter,leave} are no-ops on amd64 now, so just #define
away the calls

ok deraadt@ mpi@ miod@


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.136 10-Jul-2023 guenther

Enable Indirect Branch Tracking for amd64 userland, using XSAVES/XRSTORS
to save/restore the state and enabling it at exec-time (and for
signal handling) if the PS_NOBTCFI flag isn't set.

Note: this changes the format of the sc_fpstate data in the signal
context to possibly be in compressed format: starting now we just
guarantee that that state is in a format understood by the XRSTOR
instruction of the system that is being executed on.

At this time, passing sigreturn a corrupt sc_fpstate now results
in the process exiting with no attempt to fix it up or send a
T_PROTFLT trap. That may change.

prodding by deraadt@
issues with my original signal handling design identified by kettenis@

lots of base and ports preparation for this by deraadt@ and the
libressl and ports teams

ok deraadt@ kettenis@


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.135 05-Jul-2023 anton

The hypercall page populated with instructions by the hypervisor is not IBT
compatible due to lack of endbr64. Replace the indirect call with a new
hv_hypercall_trampoline() routine which jumps to the hypercall page without any
indirection.

Allows me to boot OpenBSD using Hyper-V on Windows 11 again.

ok guenther@


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.134 17-Apr-2023 deraadt

For future userland IBT, the sigcode needs to start with a endbr64.
This is simpler than clearing the cet_u bits in the kernel.
ok guenther, kettenis


# 1.133 17-Apr-2023 deraadt

IDTVEC_NOALIGN() was the incorrect way to create a label in two places,
use GENTRY() instead. Also add two endbr64 which cannot be supplied by
macros
ok guenther


Revision tags: OPENBSD_7_3_BASE
# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.132 20-Jan-2023 deraadt

On cpu with the PKU feature, prot=PROT_EXEC pages now create pte which
contain PG_XO, which is PKU key1. On every exit from kernel to userland,
force the PKU register to inhibit data read against key1 memory. On
(some) traps into the kernel if the PKU register is changed, abort the
process (processes have no reason to change the PKU register). This
provides us with viable xonly functionality on most modern intel & AMD
cpus. I started with a xsave-based diff from dv@, but discovered the
fpu save/restore logic wasn't a good fit and went to direct register management.
Disabled on HV (vm) systems until we know they handle PKU correctly.
ok kettenis, dv, guenther, etc


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.131 01-Dec-2022 guenther

_C_LABEL() is no longer useful in the "everything is ELF" world.
Start eliminating it.

ok mpi@ mlarkin@ krw@


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.130 29-Nov-2022 guenther

Move the generic variable definitions from the ASM at the top of
locore.S to be in C in cpu.c, machdep.c, pmap.c, or bus_space.c for
better typing/debug info. Delete REALBASEMEM, REALEXTMEM, and
biosextmem as unused/ignored.

ok mpi@ krw@ mlarkin@


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.129 04-Nov-2022 kettenis

EFI firmware has bugs which may mean that calling EFI runtime services will
fault because it does memory accesses outside of the regions it told us to
map. Try to mitigate this by installing a fault handler (using the
pcb_onfault mechanism) and bail out using longjmp(9) if we encounter a
page fault while executing an EFI runtime services call.

Since some firmware bugs result in us executing code that isn't mapped,
make kpageflttrap() handle execution faults as well as data faults.

ok guenther@


Revision tags: OPENBSD_7_2_BASE
# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.128 07-Aug-2022 guenther

Start to add annotations to the cpu_info members, doing I/a/o for
immutable/atomic/owned ala <sys/proc.h>. Move CPUF_USERSEGS and
CPUF_USERXSTATE, which really are private to the CPU, into a new
ci_pflags and rename s/CPUF_/CPUPF_/. Make all (remaining) ci_flags
alterations via atomic_{set,clear}bits_int(), so its annotation
isn't a lie. Delete ci_info member as unused all the way from
rev 1.1

ok jsg@ mlarkin@


Revision tags: OPENBSD_7_1_BASE
# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.127 31-Dec-2021 jsg

specifed -> specified


Revision tags: OPENBSD_7_0_BASE
# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.126 04-Sep-2021 bluhm

To mitigate against spectre attacks, AMD processors without the
IBRS feature need an lfence instruction after every near ret. Place
them after all functions in the kernel which are implemented in
assembler. Change the retguard macro so that the end of the lfence
instruction is 16-byte aligned now. This prevents that the ret
instruction is at the end of a 32-byte boundary. The latter would
cause a performance impact on certain Intel processors which have
a microcode update to mitigate the jump conditional code erratum.
See software techniques for managing speculation on AMD processors
revision 9.17.20 mitigation G-5.
See Intel mitigations for jump conditional code erratum revision
1.0 november 2019 2.4 software guidance and optimization methods.
OK deraadt@ mortimer@


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.125 18-Jun-2021 guenther

The pmap needs to know which CPUs to send IPIs when TLB entries
need to be invalidated. Instead of keeping a bitset of CPUs in
each pmap, have each cpu_info track which pmap it has loaded: replace
pmap->pm_cpus with cpu_info->ci_proc_pmap. This reduces the atomic
operations (and cache thrashing) and simplifies cpu_switchto()

Also, fix a defect in cpu_switchto()'s "am I loading the same cr3?"
test: ignore the CR3_REUSE_PCID bit when checking that. This makes
switching between kernel threads slightly less costly.

over a week in snaps with no complaints
looks ok to mlarkin@ kettenis@ mpi@


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

branches: 1.122.2;
Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

branches: 1.120.4;
Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.124 01-Jun-2021 guenther

Don't clear the cpu's bit in the old pmap's pm_cpus until we're off
the old one and set it in the new pmap's pm_cpus before loading
%cr3 with the new value. In particular, do neither if %cr3 isn't
changing.

This eliminates a window where, when switching between threads in
a single a process, the pmap wouldn't have this cpu's bit set even
though we didn't change %cr3. With more of uvm unlocked, it was
possible for another cpu to update the page tables but not see a
need to send an IPI to this cpu, leading to crashes when TLB entries
that should have been invalidated were used.

malloc_duel testing by abluhm@
ok abluhm@ kettenis@ mlarkin@


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.123 25-May-2021 guenther

clang's assembler now supports 64-suffixed versions of the
fxsave/xsave/fxrstor/xrstor family of instructions. Use them
directly instead of inserting the 0x48 prefix manually.

ok kettenis@ deraadt@


Revision tags: OPENBSD_6_9_BASE
# 1.122 03-Nov-2020 guenther

Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.122 03-Nov-2020 guenther

Give sizes to more of the functions in locore.S

ok mpi@


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.121 02-Nov-2020 guenther

Restore abstraction of register saving into macros in frameasm.h
The Meltdown mitigation work ran right across the previous abstractions;
draw slightly different lines and use separate macros for interrupts
vs traps vs syscall.

The generated ASM for traps and general interrupts is completely
unchanged; the ASM for the four directly routed interrupts is brought
into line with the general interrupts; the ASM for syscalls is
changed to delay reenabling interrupts until after all registers
are saved and cleared.

ok mpi@


Revision tags: OPENBSD_6_8_BASE
# 1.120 17-May-2020 deraadt

Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.120 17-May-2020 deraadt

Put setjmp+longjmp inside #ifdef DDB the only kernel-side user.
This shrinks the ramdisks a tiny bit.


Revision tags: OPENBSD_6_6_BASE OPENBSD_6_7_BASE
# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.119 07-Aug-2019 guenther

Mitigate CVE-2019-1125: block speculation past conditional jump to mis-skip
or mis-take swapgs in interrupt path and in trap/fault/exception path. The
latter is improved to have no conditionals around this when Meltdown mitigation
is in effect. Codepatch out the fences based on the description of CPU bugs
in the (well written) Linux commit message.

feedback from kettenis@
ok deraadt@


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

branches: 1.116.2;
Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

branches: 1.111.2;
In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.118 17-May-2019 guenther

Mitigate Intel's Microarchitectural Data Sampling vulnerability.
If the CPU has the new VERW behavior than that is used, otherwise
use the proper sequence from Intel's "Deep Dive" doc is used in the
return-to-userspace and enter-VMM-guest paths. The enter-C3-idle
path is not mitigated because it's only a problem when SMT/HT is
enabled: mitigating everything when that's enabled would be a _huge_
set of changes that we see no point in doing.

Update vmm(4) to pass through the MSR bits so that guests can apply
the optimal mitigation.

VMM help and specific feedback from mlarkin@
vendor-portability help from jsg@ and kettenis@
ok kettenis@ mlarkin@ deraadt@ jsg@


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.117 12-May-2019 guenther

Delete cpu_idle_{enter,leave}_fcn() as unused. Add RETGUARD checks to
cpu_idle_cycle()

ok mpi@ kettenis@


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


Revision tags: OPENBSD_6_5_BASE
# 1.116 02-Apr-2019 mortimer

Add variable length trap padding between the retguard epilogue and the
following return.

This change adds a constraint that the name passed to the RETGUARD_* macros
must correspond to the name in the corresponding ENTRY which starts the
function (or a function which appears beforehand in the same file). Since
we use the distance from the ENTRY definition to calculate how much padding
to insert, the ENTRY symbol must be in scope at assembly time. This is
almost always the case already, since it is the natural way to name the
retguard symbols so they remain unique.

ok deraadt@


# 1.115 01-Apr-2019 mortimer

Add retguard macros to kernel setjmp / longjmp.

ok deraadt@ kettenis@


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.114 18-Feb-2019 yasuoka

Remove PTPpaddr and use proc0.p_addr->u_pcb.pcb_cr3 instead. This also
fixes kernel core dump to be readable by savecore. From fukaumi at
soum.co.jp

ok mlarkin


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.113 24-Jan-2019 deraadt

gdt64 is only used by locore0 during the gut-wrenching 32-bit bring-up, so
move it to right place.


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.112 20-Jan-2019 mlarkin

Implement rdmsr_safe

rdmsr_safe is used when reading potentially missing MSRs, to avoid
triggering #GPs in the kernel.

ok guenther


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


Revision tags: OPENBSD_6_4_BASE
# 1.111 07-Oct-2018 guenther

In vmm, handle xsetbv like xrstor: instead of trying to prevalidate
the values, just try it and handle the #GP if it faults.

Problem reported by Maxime Villard (max(at)m00nbsd.net)
ok mlarkin@


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.110 04-Oct-2018 guenther

Use PCIDs where they and the INVPCID instruction are available.
This uses one PCID for kernel threads, one for the U+K tables of
normal processes, one for the matching U-K tables (when meltdown
in effect), and one for temporary mappings when poking other
processes. Some further tweaks are envisioned but this is good
enough to provide more separation and has (finally) been stable
under ports testing.

lots of ports testing and valid complaints from naddy@ and sthen@
feedback from mlarkin@ and sf@


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.109 12-Sep-2018 guenther

Now that the pmap is more paranoid about some shootdowns (pmap.c rev 1.119),
avoid some TLB flushes by not reloading %cr3 when the value isn't changing.

original diff by and ok mlarkin@


# 1.108 09-Sep-2018 guenther

Calculate automatically the padding necessary for lining up the
iretq instruction used when Meltdown mitigation is effect. It got
pushed off when an lfence was added in locore.S rev 1.107, resulting
in two signals being sent instead of one when iretq faulted, and
neither signal had the correct sigcontext info. Update the makefile
rule for locore.o to verify that things are correct.

ok mlarkin@


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.107 24-Jul-2018 guenther

Also do RSB refilling when context switching, after vmexits, and
when vmlaunch or vmresume fails.

Follow the lead of clang and the intel recommendation and do an lfence
after the pause in the speculation-stop path for retpoline, RSB refill,
and meltover ASM bits.

ok kettenis@ deraadt@


# 1.106 23-Jul-2018 guenther

Do "Return stack refilling", based on the "Return stack underflow" discussion
and its associated appendix at https://support.google.com/faqs/answer/7625886
This should address at least some cases of "SpectreRSB" and earlier
Spectre variants; more commits to follow.

The refilling is done in the enter-kernel-from-userspace and
return-to-userspace-from-kernel paths, making sure to do it before
unblocking interrupts so that a successive interrupt can't get the
CPU to C code without doing this refill. Per the link above, it
also does it immediately after mwait, apparently in case the low-power
CPU states of idle-via-mwait flush the RSB.

ok mlarkin@ deraadt@


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.105 12-Jul-2018 guenther

Reorganize the Meltdown entry and exit trampolines for syscall and
traps so that the "mov %rax,%cr3" is followed by an infinite loop
which is avoided because the mapping of the code being executed is
changed. This means the sysretq/iretq isn't even present in that
flow of instructions in the kernel mapping, so userspace code can't
be speculatively reached on the kernel mapping and totally eliminates
the conditional jump over the the %cr3 change that supported CPUs
without the Meltdown vulnerability. The return paths were probably
vulnerable to Spectre v1 (and v1.1/1.2) style attacks, speculatively
executing user code post-system-call with the kernel mappings, thus
creating cache/TLB/etc side-effects.

Would like to apply this technique to the interrupt stubs too, but
I'm hitting a bug in clang's assembler which misaligns the code and
symbols.

While here, when on a CPU not vulnerable to Meltdown, codepatch out
the unnecessary bits in cpu_switchto().

Inspiration from sf@, refined over dinner with theo
ok mlarkin@ deraadt@


# 1.104 10-Jul-2018 deraadt

In asm.h ensure NENTRY uses the old-school nop-sled align, but change standard
ENTRY is a trapsled. Fix a few functions which fall-through into an ENTRY
macro. amd64 binaries now are free of double+-nop sequences (except for one
assember nit in aes-586.pl). Previous changes by guenther got us here.
ok mortimer kettenis


# 1.103 03-Jul-2018 mortimer

Add retguard macros for kernel asm.
ok deraadt, ok mlarkin (vmm_support)


# 1.102 01-Jul-2018 guenther

Provide _ALIGN_TRAPS macro for text alignment with a trap-sled, then
use it where that was manually written before. No binary change.

ok deraadt@


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.101 14-Jun-2018 guenther

Clear the GPRs when entering the kernel from userspace so that
user-controlled values can't take part in speculative execution in
the kernel down paths that end up "not taken" but that may cause
user-visible effects (cache, etc).

prodded by dragonflybsd commit 9474cbef7fcb61cd268019694d94db6a75af7dbe
ok deraadt@ kettenis@


# 1.100 09-Jun-2018 guenther

Move all the DDBPROF logic into the trap03 (#BP) handler to keep alltraps
and intr_fast_exit clean

ok mpi@


# 1.99 07-Jun-2018 guenther

Apply the retpoline transformation to indirect jumps in the raw ASM

ok mlarkin@ mortimer@ deraadt@


# 1.98 05-Jun-2018 guenther

Switch from lazy FPU switching to semi-eager FPU switching: track whether
curproc's xstate ("extended state") is loaded in the CPU or not.
- context switch, sendsig(), vmm, and doing CPU crypto in the kernel all
check the flag and, if set, save the old thread's state to the PCB,
clear the flag, and then load the _blank_ state
- when returning to userspace, if the flag is clear then set it and restore
the thread's state

This simpler tracking also fixes the restoring of FPU state after nested
signal handlers.

With this, %cr0's TS flag is never set, the FPU #DNA trap can no
longer happen, and IPIs are no longer necessary for flushing or
syncing FPU state; on the other hand, restoring xstate while returning
to userspace means we have to handle xrstor faulting if we could
be loading an altered state. If that happens, reset the state,
fake a #GP fault (SIGBUS), and recheck for ASTs.

While here, regularize fxsave/fxrstor vs xsave/xrstor handling, by
using codepatching to switch to xsave/xrstor when present in the
CPU. In addition, code patch in use of xsaveopt in most places
when the CPU supports that. Use the 64bit-wide variants of the
instructions in all cases so that x87 instruction fault IPs are
reported correctly.

This change has three motivations:
1) with modern clang, SSE registers are used even in rcrt0.o, making
lazy FPU switching a smaller benefit vs trap costs
2) the Intel SDM warns that lazy FPU switching may increase power costs
3) post-Spectre rumors suggest that the %cr0 TS flag might not block
speculation, permitting leaking of information about FPU state
(AES keys?) across protection boundaries.

tested by many in snaps; prodding from deraadt@


# 1.97 05-Jun-2018 guenther

Split "return to userspace via iretq" from intr_fast_exit into intr_user_exit.
Move AST handling from the bottom of alltraps and Xdoreti to the
top of the new routine.
syscall-return-via-iretq and the FPU #DNA trap jump into intr_user_exit after
the AST check (already performed for the former, skipped for the latter)
Delete a couple debugging hooks mlarkin@ and I used during Meltdown work

tested by many in snaps; thanks to brynet@ for spurious interrrupt testing
earlier reviews and comments kettenis@ mlarkin@; prodding from deraadt@


# 1.96 20-May-2018 guenther

Stash the syscall number in tf_err so it can be reported by the SPL check

ok mlarkin@ mpi@


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

branches: 1.94.2;
Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.95 26-Apr-2018 guenther

Prefer leaq+%rip-relative over movabsq
xrstor_resume must not have profile prologue, so use NENTRY
Don't use _C_LABEL() with some pure-ASM labels


Revision tags: OPENBSD_6_3_BASE
# 1.94 21-Feb-2018 guenther

Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

branches: 1.89.2;
Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.94 21-Feb-2018 guenther

Meltdown: implement user/kernel page table separation.

On Intel CPUs which speculate past user/supervisor page permission checks,
use a separate page table for userspace with only the minimum of kernel code
and data required for the transitions to/from the kernel (still marked as
supervisor-only, of course):
- the IDT (RO)
- three pages of kernel text in the .kutext section for interrupt, trap,
and syscall trampoline code (RX)
- one page of kernel data in the .kudata section for TLB flush IPIs (RW)
- the lapic page (RW, uncachable)
- per CPU: one page for the TSS+GDT (RO) and one page for trampoline
stacks (RW)

When a syscall, trap, or interrupt takes a CPU from userspace to kernel the
trampoline code switches page tables, switches stacks to the thread's real
kernel stack, then copies over the necessary bits from the trampoline stack.
On return to userspace the opposite occurs: recreate the iretq frame on the
trampoline stack, switch stack, switch page tables, and return to userspace.

mlarkin@ implemented the pmap bits and did 90% of the debugging, diagnosing
issues on MP in particular, and drove the final push to completion.
Many rounds of testing by naddy@, sthen@, and others
Thanks to Alex Wilson from Joyent for early discussions about trampolines
and their data requirements.
Per-CPU page layout mostly inspired by DragonFlyBSD.

ok mlarkin@ deraadt@


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)


# 1.93 07-Jan-2018 mlarkin

remove all PG_G global page mappings from the kernel when running on
Intel CPUs. Part of an ongoing set of commits to mitigate the Intel
"meltdown" CVE. This diff does not confer any immunity to that
vulnerability - subsequent commits are still needed and are being
worked on presently.

ok guenther, deraadt


# 1.92 06-Jan-2018 guenther

Handle %gs like %[def]s and reset set it in cpu_switchto() instead of on
every return to userspace.

ok kettenis@ mlarkin@


# 1.91 10-Oct-2017 mlarkin

remove a unused variable

ok tom, kettenis, deraadt


# 1.90 05-Oct-2017 mlarkin

Clean up some no longer needed includes left over from the locore/locore0 split.

ok tom, mpi, deraadt


Revision tags: OPENBSD_6_2_BASE
# 1.89 04-Oct-2017 guenther

Follow the pattern set by copy*/pcb_onfault: when xrstor faults, return
from the trap to a 'resume' address to effectively make xrstor_user()
return an error indication, then do the FPU cleanup and trap generation
from there where we can get access to the original, userspace trapframe.

The original fix tried to handle the trap while on the wrong trapframe,
leaking kernel addresses and possibly leading to double faults.
Problem pointed out by abluhm@
ok deraadt@ mikeb@


# 1.88 03-Oct-2017 guenther

The xrstor instruction will fault if the provided xstate data, which
is under userspace control via sigreturn, fails various consistency
checks. Rather than trying to replicate the CPU's hardwired checks
in C code, handle it like iretq: check in trap() whether a fault
is from the problem instruction and handle it there.

CPU behavior and the potential issue pointed out on Linux kernel-hardening
ok mikeb@ deraadt@


# 1.87 06-Jul-2017 deraadt

0xcc-fill a few more alignments. Not because these ones matter particularily,
but because elimination highlights more important ones.
Cursory review mortimer, ok mlarkin


# 1.86 29-Jun-2017 deraadt

Put asm-generated strings into .rodata
ok millert


# 1.85 31-May-2017 deraadt

Split early startup code out of locore.S into locore0.S. Adjust link
run so that this locore0.o is always at the start of the executable.
But randomize the link order of all other .o files in the kernel, so
that their exec/rodata/data/bss segments land all over the place.
Late during kernel boot, unmap the early startup code.

As a result, the internal layout of every newly build bsd kernel is
different from past kernels. Internal relative offsets are not known
to an outside attacker. The only known offsets are in the startup code,
which has been unmapped.

Ramdisk kernels cannot be compiled like this, because they are gzip'd.
When the internal pointer references change, the compression dictionary
bloats and results in poorer compression.

ok kettenis mlarkin visa, also thanks to tedu for getting me back to this


Revision tags: OPENBSD_6_1_BASE
# 1.84 06-Feb-2017 mpi

branches: 1.84.4;
Sync a comment with i386.


# 1.83 04-Sep-2016 mpi

Introduce Dynamic Profiling, a ddb(4) based & gprof compatible kernel
profiling framework.

Code patching is used to enable probes when entering functions. The
probes will call a mcount()-like function to match the behavior of a
GPROF kernel.

Currently only available on amd64 and guarded under DDBPROF. Support
for other archs will follow soon.

A new sysctl knob, ddb.console, need to be set to 1 in securelevel 0
to be able to use this feature.

Inputs and ok guenther@


Revision tags: OPENBSD_6_0_BASE
# 1.82 16-Jul-2016 mlarkin

branches: 1.82.2;

remove some unused #includes


# 1.81 22-Jun-2016 mikeb

Setup Hyper-V hypercall page and an IDT vector.

ok mlarkin, kettenis, deraadt


# 1.80 06-Jun-2016 deraadt

Fill a few more pads with 0xcc
ok mikeb, mlarkin


# 1.79 23-May-2016 deraadt

Place a cpu-dependent trap/illegal instruction over the remainder of the
sigtramp page, so that it will generate a nice kernel fault if touched.
While here, move most of the sigtramps to the .rodata segment, because
they are not executed in the kernel.
Also some preparation for sliding the actual sigtramp forward (will need
some gdb changes)
ok mlarkin kettenis


# 1.78 10-May-2016 deraadt

SROP mitigation. sendsig() stores a (per-process ^ &sigcontext) cookie
inside the sigcontext. sigreturn(2) checks syscall entry was from the
exact PC addr in the (per-process ASLR) sigtramp, verifies the cookie,
and clears it to prevent sigcontext reuse.
not yet tested on landisk, sparc, *88k, socppc.
ok kettenis


# 1.77 10-May-2016 mikeb

Fill Xen hypercall page with int3's like the hypervisor does.

Idea from deraadt@ and mlarkin@.


# 1.76 26-Feb-2016 mlarkin

SYMTAB_SPACE is no longer used (last used with a.out ddb)


Revision tags: OPENBSD_5_9_BASE
# 1.75 04-Jan-2016 mlarkin

wrap a long line


# 1.74 08-Dec-2015 mikeb

Setup a hypercall page in the kernel .text segment

Its location will be communicated with the Xen hypervisor
that will fill it in with instructions resulting in VMEXIT
events.

Discussed with kettenis@ and deraadt@, with input from and
OK mpi, mlarkin, reyk


# 1.73 09-Nov-2015 mlarkin

Cache the result of cpuid leaf function $0x1 from the host's boot CPU
during locore, information based on this will be returned to guest VMs
issuing cpuid instructions later, under certain circumstances.


Revision tags: OPENBSD_5_8_BASE
# 1.72 17-Jul-2015 guenther

Consistently use SEL_RPL as the mask when testing selector privilege level


# 1.71 17-Jul-2015 mlarkin

"are we 386, 386sx, or 486, or Pentium, or.."

I'm pretty sure the amd64 kernel won't boot on any of those CPUs, so
delete the (unused) variable that was supposed to track which 32 bit
CPU we were running on.


# 1.70 16-Jul-2015 mlarkin

remove 'cpu_brand_id' as we no longer use that method to calculate the
name of the cpu. Further, the calculation of cpu_brand_id was in the
wrong place to begin with, so it was being calculated incorrectly anyway.


# 1.69 16-Jul-2015 mlarkin

Fix a backward compare in boot argument parsing, and clarify a comment that
was wrong.

ok guenther@


# 1.68 28-Jun-2015 guenther

Force the return to userspace from execve to go through iretq to get all
registers. This lets us kill the special handling of pid 1 in fork and
merge {proc,child}_trampoline(). Do the same if ptrace(PT_SETREGS) is used
to modify registers.

ok mlarkin@ kettenis@


# 1.67 28-Jun-2015 guenther

Split AST handling from trap() into ast() and get rid of T_ASTFLT.
Don't skip the AST check when returning from *fork() in the child.
Make sure to count interrupts even when they're deferred or stray.

testing by krw@, and then many via snapshots


# 1.66 23-Jun-2015 bluhm

If the kernel symbols fit completely into the 2 MB alignment hole
after kernel bss but before end of the image, the page tables used
the read-only mapping of the hole. When booting a small non-generic
kernel, this resulted in a crash, while writing to the page tables
later.
Make sure that the page tables are created after esym and after
end.
OK mlarkin@ deraadt@


# 1.65 18-May-2015 guenther

Do lazy update/reset of the FS.base and %[def]s segment registers: reseting
segment registers in cpu_switchto if the old thread had made it to userspace
and restoring FS.base only on first return to userspace since context switch.

ok mlarkin@


# 1.64 18-Apr-2015 guenther

i386 and amd64 have only one syscall entry point now, so simply the
EIP/RIP adjustment for ERESTART

ok mlarkin@


# 1.63 22-Mar-2015 guenther

Explain the state on syscall entry


Revision tags: OPENBSD_5_7_BASE
# 1.62 16-Jan-2015 sf

Binary code patching on amd64

This commit adds generic infrastructure to do binary code patching on amd64.
The existing code patching for SMAP is converted to the new infrastruture.

More consumers and support for i386 will follow later.

This version of the diff has some simplifications in codepatch_fill_nop()
compared to a version that was:

OK @kettenis @mlarkin @jsg


# 1.61 21-Dec-2014 mlarkin

Prevent writing to the kernel area via the direct map. We do this by padding
the end of the kernel area to 2MB, so that the direct map pages can then
have the W permission removed (X permission was already removed in a previous
diff). This creates a VA hole at the end of bss, so adjust for that since
that's where symbols get loaded by the bootloader (for now, map that region
RO until the boot loader can be updated to place the symbols at "end" instead
of "end of bss").

with help from and ok deraadt@


# 1.60 27-Nov-2014 mlarkin

Missing comparison caused NX to always be enabled during boot, even on CPUs
that may have had it disabled in BIOS.

ok deraadt@


# 1.59 20-Nov-2014 mlarkin

When removing the identity mapping in low memory used during bootstrap,
there is no reason to keep the NX bit around on null PTEs (PTEs that have
been removed).


# 1.58 20-Nov-2014 mlarkin

Move previous PTE permission fixup code into locore, and fixup some more
ranges while we're there.

ok deraadt@, tested by many and in snaps


# 1.57 07-Nov-2014 mlarkin

Wrong comment - NX is handled later (for now), not in locore. No functional
change.

noticed by deraadt@


# 1.56 05-Nov-2014 mlarkin

Map .rodata RO after boot on amd64. Makefile.amd64 changes from deraadt.

ok deraadt@


# 1.55 09-Oct-2014 tedu

no need for lkm_map now


Revision tags: OPENBSD_5_3_BASE OPENBSD_5_4_BASE OPENBSD_5_5_BASE OPENBSD_5_6_BASE
# 1.54 10-Nov-2012 mglocker

Recent x86 CPUs come with a constant time stamp counter. If this is
the case we verify if the CPU supports a specific version of the
architectural performance monitoring feature and read out the current
frequency from the fixed-function performance counter of the unhalted
core.

My initial motivation to implement this was the Soekris net6501-70
which comes with an Intel Atom E6xx 1.60GHz CPU. It has a constant
time stamp counter plus speed step support and boots on the lowest
frequency of 600MHz. This caused hw.cpuspeed and hw.setperf to
reflect the wrong values.

The diff is a cooperation work with jsg@. The fixed-function
performance counter read code comes from a former diff of him.

OK jsg@


# 1.53 25-Sep-2012 pirofti

Remove unused acpi locking code.

To be replaced with higher level C routines once we settle for a common
consistent set of atomic operations across platforms.

Discussed with and okay by deraadt@ and kettenis@.


Revision tags: OPENBSD_5_2_BASE
# 1.52 06-May-2012 guenther

Garbage collect the old int$80 kernel entry point: the last use of
it by the not-normally-used sigreturn() stub in libc was changed to
use 'syscall' instruction in 5.0

ok mikeb@ jsg@


Revision tags: OPENBSD_5_1_BASE
# 1.51 26-Dec-2011 haesbaert

Add the missing ECX cpu flags from CPUID at 0x80000001.
This is all documented at:

http://support.amd.com/us/Embedded_TechDocs/25481.pdf (page 20)
http://www.intel.com/assets/pdf/appnote/241618.pdf (page 41)

ok jsg@


# 1.50 12-Oct-2011 miod

Remove all MD diagnostics in cpu_switchto(), and move them to MI code if
they apply.

ok oga@ deraadt@


# 1.49 03-Sep-2011 guenther

Add a general warning about gdb matching against sigcode instructions


Revision tags: OPENBSD_5_0_BASE
# 1.48 04-Jul-2011 guenther

Force the sigreturn syscall to return to userspace via iretq by setting
the MDP_IRET flag in md_proc, then switch sigcode to enter the kernel
via syscall instead of int$80. Rearrange the return paths in both the
sysretq and iretq paths to reduce how long interrupts are blocked and
shave instructions.

ok kettenis@, extra testing krw@


# 1.47 13-Apr-2011 guenther

Unrevert the FS.base diff: the issues were actually elsewhere
Additional testing by jasper@ and pea@


# 1.46 10-Apr-2011 guenther

Revert bulk of the FS.base diff, as it causes issues on some machines
and the problem isn't obvious yet.


# 1.45 05-Apr-2011 guenther

Add support for per-rthread base-offset for the %fs selector on amd64.
Add pcb_fsbase to the PCB for tracking what the value for the thread
is, and ci_cur_fsbase to struct cpu_info for tracking the CPU's current
value for FS.base, then on return to user-space, skip the setting if the
CPU has the right value already. Non-threaded processes without TLS leave
FS.base zero, which can be conveniently optimized: setting %fs zeros
FS.base for fewer cycles than wrmsr.

ok kettenis@


Revision tags: OPENBSD_4_9_BASE
# 1.44 04-Dec-2010 guenther

The pm_cpus member of the pmap is now a 64bit integer: update the assembly
used in cpu_switch() for handling it. Also, delete an unnecessary
instruction that I added while debugging the pm_cpus handling before

ok kettenis@


# 1.43 13-Nov-2010 guenther

Switch from TSS-per-process to TSS-per-CPU, placing the TSS right
next to the cpu's GDT, also making the double-fault stack per-CPU,
leaving it at the top of the page of the CPU's idle process. Inline
pmap_activate() and pmap_deactivate() into the asm cpu_switchto
routine, adding a check for the new pmap already being marked as
active on the CPU. Garbage collect the hasn't-been-used-in-years
GDT update IPI.

Tested by many; ok mikeb@, kettenis@


# 1.42 26-Oct-2010 guenther

The LDT is only used by dead compat code now, so load the ldt
register with the null selector (disabling use of it), stop reloading
it on every context switch, and blow away the table itself, as well
as the pcb and pmap bits that were used to track it. Also, delete
two other unused pcb members: pcb_usersp and pcb_flags. (Deleting
pcb_usersp also keeps the pcb_savefpu member aligned properly.)
Finally, delete the defines for the unimplemented AMD64_{GET,SET}_LDT
sysarch() calls.

Tested by various with both AMD and Intel chips
ok mikeb@


# 1.41 14-Oct-2010 guenther

Clean up segment handling: switch user-space to using code and data
segments in the GDT instead of the LDT and eliminate the GDT slots
that we don't actually use.

tested on both amd and intel by several
not really the right person, but ok: kettenis@


# 1.40 28-Sep-2010 guenther

Correct the handling of GS.base when iretq faults: the fault happens
with CPL == 0 but the user's GS.base, so the normal INTRENTRY handling
won't work. Contrawise, the asm that trap() redirects us to when that
happens (resume_iret) sees a trapframe showing CPL==3 but it's run with
the kernel's GS.base, so INTRENTRY won't work there either.

asm style fixes drahn@ and mikeb@
ok kettenis@


Revision tags: OPENBSD_4_6_BASE OPENBSD_4_7_BASE OPENBSD_4_8_BASE
# 1.39 09-Jun-2009 krw

revert guenther@'s un-revert of art's curpmap.

My

bios0: ASUSTeK Computer INC. P5K-E
cpu0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.74 MHz
cpu1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz
cpu3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 2405.46 MHz

can't boot with this in. It always hangs somewhere in fsck'ing if
any, or between netstart and local daemons if no fsck'ing. Also
fubars theo's real amd machine.

Much more testing needed for this.


# 1.38 06-Jun-2009 guenther

Unrevert the curpmap change with the addition of correct %gs handling
in the IPI handler so that it works when it interrupts userspace,
waiting for the droppmap IPI to complete when destroying it, and
(most importantly) don't call pmap_tlb_droppmap() from cpu_exit().
Tested by myself and ckuethe, as our machines choked on the original.

ok @art


# 1.37 05-Jun-2009 guenther

Revert the curpmap change. We know the IPI is broken on both ends,
but even with proposed fixes, the reaper panics are back.


# 1.36 02-Jun-2009 jordan

Added interface for cpu idle on amd64
ok gwk@, toby@, marco@


# 1.35 28-May-2009 art

Bring back the curpmap change. It was missing a reload of the pmap on
curcpu when we were freeing a pmap. Tested and working for a few weeks
now, but I was a bit too busy to commit it earlier.


# 1.34 27-Apr-2009 deraadt

turning pmap_deactivate into a NOP brought back the reaper panics, probably
because the reaper is running on the mappings of pmap from the process it
is about to unmap. back it out until ht is fixed right; don't let this sit
in the tree waiting for a fix.


# 1.33 23-Apr-2009 art

Make pmap_deactivate a NOP.

Instead of keeping a bitmask of on which cpu the pmap might be active which
we clear in pmap_deactivate, always keep a pointer to the currently loaded
pmap in cpu_info. We can now optimize a context switch to the kernel pmap
(idle and kernel threads) to keep the previously loaded pmap still loaded
and then reuse that pmap if we context switch back to the same process.

Introduce a new IPI to force a pmap reload before the pmap is destroyed.

Clean up cpu_switchto.

toby@ ok


# 1.32 31-Mar-2009 art

- remove obsolete comment
- remove dead (#if 0) code
- move switch_error panics to after cpu_switchto to make branch prediction
happier and the code more readable.

no functional change


Revision tags: OPENBSD_4_5_BASE
# 1.31 15-Feb-2009 mikeb

Set the limit of the GDT table to its size - 1.

Reported by and diff from Remco <remco at d-compu.dyndns.org>, thanks!
Checked with kettenis@.

ok kettenis


# 1.30 12-Nov-2008 weingart

Add a comment to sigcode() to explain why the use of 'int $0x80' is
necessary, so that future hackers will not be mislead the same way I
was when looking at this code.


# 1.29 24-Oct-2008 deraadt

remove unused label


# 1.28 13-Aug-2008 weingart

This tab had bugged me forever.


Revision tags: OPENBSD_4_4_BASE
# 1.27 28-Jul-2008 miod

No longer clear ci_want_resched within cpu_switchto(), now that it's done
in the MI code.


# 1.26 27-Jun-2008 ray

More removal of clauses 3 and 4 from NetBSD licenses.

OK deraadt@ and millert@


Revision tags: OPENBSD_4_3_BASE
# 1.25 03-Nov-2007 gwk

Add acpi_acquire_global_lock(), and acpi_release_global_lock to
amd64 the not ghetto architecture.

ok toby@


# 1.24 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


# 1.23 12-Sep-2007 deraadt

port of i386 pctr code to amd64; Mike Belopuhov


Revision tags: OPENBSD_4_2_BASE
# 1.22 27-May-2007 art

- Redo the way we set up the direct map. Map the first 4GB of it
in locore so that we can use the direct map in pmap_bootstrap when
setting up the initial page tables.

- Introduce a second direct map (I love large address spaces) with
uncached pages.

jason@ ok


Revision tags: OPENBSD_3_8_BASE OPENBSD_3_9_BASE OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.21 20-Aug-2005 jsg

Check for and report the presense of SSE3. This has started to appear
in AMD products with the arrival of the venice core.
ok deraadt@


# 1.20 26-Jul-2005 art

Instead of juggling around with cr4 and enabling parts of it sometimes,
other parts later, etc. Just set it to the same default value everywhere.
We won't survive without PSE and tt's not like someone will suddenly make
an amd64 that doesn't support PGE.

This will allow us to make the bootstrap process slightly more sane.


# 1.19 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.18 27-May-2005 art

Stop pretending that amd64 is i386. We're insulting the cpu by not even
pretending to use all the address space it gives us.

- Map all physical memory 1-1 and implement PMAP_DIRECT
- Remove the vast magic we do to map pages for pmap_zero_page,
pmap_copy_page, pv allocation, magic while bootstrapping,
reading of /dev/mem, etc.
- implement a fast pmap_zero_page based on sse instructions.

I love removing code. More to come.

deraadt@ ok tested by many.


# 1.17 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_7_BASE
# 1.16 06-Jan-2005 martin

missing $OpenBSD$


# 1.15 01-Jan-2005 millert

gcc 3.3.5 will store zero-initialized variables in bss by default,
move bootdev to data so it doesn't get zapped when bss is cleared.
deraadt@ OK


Revision tags: OPENBSD_3_6_BASE
# 1.14 25-Jun-2004 art

SMP support. Big parts from NetBSD, but with some really serious debugging
done by me, niklas and others. Especially wrt. NXE support.

Still needs some polishing, especially in dmesg messages, but we're now
building kernel faster than ever.


# 1.13 22-Jun-2004 art

Switch amd64 to __HAVE_CPUINFO

deraadt@ ok


# 1.12 21-Jun-2004 niklas

Pure luck has protected us from this bug until now: locore.S
%r9 are not saved over function calls
and more we did not even want &proc0 as the old process in switch_search, but zero. Fixes bsd.rd.


# 1.11 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.10 13-May-2004 sturm

activate systrace on amd64, while here get rid of syscall_{plain,fancy}
instead use syscall() as everywhere else

ok mickey, tested and ok tedu@


Revision tags: OPENBSD_3_5_BASE
# 1.9 25-Feb-2004 deraadt

dkcsum stuff for amd64, written by tom, who cannot commit it at the moment.
now the amd64 knows what drive it was booted from.


# 1.8 23-Feb-2004 mickey

the consdev pass from boot and a cnset() in the kernel seems to be a bit premature and cause problems


# 1.7 23-Feb-2004 mickey

get use of NX; partially from netbsd; passes the regress; deraadt@ ok


# 1.6 23-Feb-2004 tom

- Pick up the /boot argc, argv in locore.S (though not currently used)
- Probe for console devices (incl serial) in /boot
- Pass console device from /boot to kernel (temp via additional param)

With this, boot> set tty com0 now works.

"just don't break a build" deraadt@


# 1.5 22-Feb-2004 tom

- Make comment about parameters passed by /boot reflect reality
- Don't use _C_LABEL() on a parameter given to RELOC(), since RELOC()
does this itself

ok mickey@


# 1.4 20-Feb-2004 deraadt

use an old syscall (int $0x80) for the sigreturn; otherwise %rcx is trashed.
we've been chasing this for 2 weeks.. finally spotted by kettenis@chello.nl


# 1.3 07-Feb-2004 miod

branches: 1.3.2;
Be sure to flag pte constants as UL, and cope with this in locore.
ok deraadt@


# 1.2 03-Feb-2004 mickey

das boot; das cloned das from das i386


# 1.1 28-Jan-2004 mickey

an amd64 arch support.
hacked by art@ from netbsd sources and then later debugged
by me into the shape where it can host itself.
no bootloader yet as needs redoing from the
recent advanced i386 sources (anyone? ;)