History log of /freebsd-current/sys/amd64/amd64/exception.S
Revision Date Author Comments
# 8551c31b 11-Apr-2024 Elyes Haouas <ehaouas@noos.fr>

exception: Fix typos

Signed-off-by: Elyes Haouas <ehaouas@noos.fr>
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/885


# 95ee2897 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: two-line .h pattern

Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/


# c6d31b83 18-Jul-2022 Konstantin Belousov <kib@FreeBSD.org>

AST: rework

Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888


# 7aa47cac 25-Aug-2021 Konstantin Belousov <kib@FreeBSD.org>

amd64: remove lfence after swapgs on syscall entry

According to the description of SBSS issue at
https://software.intel.com/content/www/us/en/develop/articles/software-security-guidance/technical-documentation/speculative-behavior-swapgs-and-segment-registers.html
lfence after swapgs is needed only for the case when swapgs could be
speculatively executed. Since syscall entry, unlike exception and
interrupt entries, executes swapgs unconditionally, there is no
opportunity for speculation.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31682


# b0f71f1b 10-Aug-2021 Mark Johnston <markj@FreeBSD.org>

amd64: Add MD bits for KMSAN

Interrupt and exception handlers must call kmsan_intr_enter() prior to
calling any C code. This is because the KMSAN runtime maintains some
TLS in order to track initialization state of function parameters and
return values across function calls. Then, to ensure that this state is
kept consistent in the face of asynchronous kernel-mode excpeptions, the
runtime uses a stack of TLS blocks, and kmsan_intr_enter() and
kmsan_intr_leave() push and pop that stack, respectively.

Use these functions in amd64 interrupt and exception handlers. Note
that handlers for user->kernel transitions need not be annotated.

Also ensure that trap frames pushed by the CPU and by handlers are
marked as initialized before they are used.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31467


# aa3ea612 31-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

x86: remove gcov kernel support

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D29529


# 098dbd7f 26-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

amd64 nmi handler: fix comment about %ebx

Reported by: markj
MFC after: 3 days


# d39f7430 25-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

amd64: preserve %cr2 in NMI/MCE/DBG handlers.

These handlers could interrupt code which has interrupts disabled,
and if a spurious page fault occurs during exception handler run,
we get clobbered %cr2 in higher level stack.

This is mostly a speculation, but it is based on hints from good sources.

MFC after: 1 week
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27772


# 0675f4e1 19-Jul-2020 Konstantin Belousov <kib@FreeBSD.org>

Simplify non-pti syscall entry on amd64.

Limit manipulations to use %rax as scratch to the pti portion of the
syscall entry code.

Submitted by: alc
Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25722


# 3ec7e169 18-Jul-2020 Konstantin Belousov <kib@FreeBSD.org>

amd64 pmap: microoptimize local shootdowns for PCID PTI configurations

When pmap operates in PTI mode, we must reload %cr3 on return to
userspace. In non-PCID mode the reload always flushes all non-global
TLB entries and we take advantage of it by only invalidating the KPT
TLB entries (there is no cached UPT entries at all).

In PCID mode, we flush both KPT and UPT TLB explicitly, but we can
take advantage of the fact that PCID mode command to reload %cr3
includes a flag to flush/not flush target TLB. In particular, we can
avoid the flush for UPT, instead record that load of pc_ucr3 into %cr3
on return to usermode should be flushing. This is done by providing
either all-1s or ~CR3_PCID_MASK in pc_ucr3_load_mask. The mask is
automatically reset to all-1s on return to usermode.

Similarly, we can avoid flushing UPT TLB on context switch, replacing
it by setting pc_ucr3_load_mask. This unifies INVPCID and non-INVPCID
PTI ifunc, leaving only 4 cases instead of 6. This trick is also
applicable both to the TLB shootdown IPI handlers, since handlers
interrupt the target thread.

But then we need to check pc_curpmap in handlers, and this would
reopen the same race for INVPCID machines as was fixed in r306350 for
non-INVPCID. To not introduce the same bug, unconditionally do
spinlock_enter() in pmap_activate().

Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D25483


# da248a69 20-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

amd64: in double fault handler, do not rely on sane gsbase value.

Typical reasons for doublefault faults are either kernel stack
overflow or bugs in the code that manipulates protection CPU state.
The later code is the code which often has to set up gsbase for
kernel. Switching to explicit load of GSBASE MSR in the fault handler
makes it more probable to output a useful information.

Now all IST handlers have nmi_pcpu structure on top of their stacks.

It would be even more useful to save gsbase value at the moment of the
fault. I did not this because I do not want to modify PCB layout now.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# c4f056e8 13-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

amd64: only set PCB_FULL_IRET pcb flag when #gp or similar exception comes
from usermode.

If CPU supports RDFSBASE, the flag also means that userspace fsbase
and gsbase are already written into pcb, which might be not true when
we handle #gp from kernel.

The offender is rdmsr_safe(), and the visible result is corrupted
userspace TLS base.

Reported by: pstef
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 83ba1468 03-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

amd64: Store %cr3 into pcpu saved_ucr3 on double fault.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 90e35b0a 06-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

amd64: prevents speculations over swapgs reload of %gs base.

Such speculations could use user-controlled %gs base, esp. since
FreeBSD supports WRGSBASE instructions.

Place LFENCEs on entry for each basic block after the test for
previous kernel/user mode on the kernel entry, which prevents the
speculation. Code accesses %gs-based PCPU before any serialization
instructions are executed, like %cr3 reload for KPTI.

With pti disabled, on haswell i7-4770S machine, "syscall_timings getppid"
shows when no lfence is added to syscall path:
test loop time iterations periteration
getppid 0 1.040918865 4643611 0.000000224
getppid 1 1.004985962 4481816 0.000000224
getppid 2 1.005196483 4482363 0.000000224
with lfence:
getppid 0 1.043701091 4554779 0.000000229
getppid 1 1.016930328 4438094 0.000000229
getppid 2 1.023223117 4466640 0.000000229
and ministat reports 'No difference proven at 95.0% confidence.'

Security: CVE-2019-1125
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 1947b298 03-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

amd64: Streamline exceptions and interrupts handlers.

PTI-mode entry points were coded to set up the environment identical
to non-PTI entry and then fall-through to non-PTI handlers, mostly.
This has the drawback of requiring two more SWAPGS, first to access
PCPU, and then to return to the state expected by the non-PTI entry
point.

Eliminate the duplication by doing more in entry stubs both for PTI
and non-PTI, and adjusting the common code to expect that SWAPGS and
some minimal registers saving is done by entries.

Some less often used entries, in particular, #GP, #NP, and #SS, which
can fault on doreti, are left as is because there are basically four
variants of entrance, and they are not performance-critical,
esp. comparing with e.g. #PF or interrupts.

Reviewed by: markj (previous version)
Tested by: pho (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 7355a02b 14-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Mitigations for Microarchitectural Data Sampling.

Microarchitectural buffers on some Intel processors utilizing
speculative execution may allow a local process to obtain a memory
disclosure. An attacker may be able to read secret data from the
kernel or from a process when executing untrusted code (for example,
in a web browser).

Reference: https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00233.html
Security: CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091
Security: FreeBSD-SA-19:07.mds
Reviewed by: jhb
Tested by: emaste, lwhsu
Approved by: so (gtetlow)


# 762138f7 05-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

amd64: clear callee-preserved registers on syscall exit.

%r8, %r10, and on non-KPTI configuration %r9 were not restored on fast
return from a syscall.

Reviewed by: markj
Approved by: so
Security: CVE-2019-5595
Sponsored by: The FreeBSD Foundation
MFC after: 0 minutes


# c1141fba 19-Aug-2018 Konstantin Belousov <kib@FreeBSD.org>

Update L1TF workaround to sustain L1D pollution from NMI.

Current mitigation for L1TF in bhyve flushes L1D either by an explicit
WRMSR command, or by software reading enough uninteresting data to
fully populate all lines of L1D. If NMI occurs after either of
methods is completed, but before VM entry, L1D becomes polluted with
the cache lines touched by NMI handlers. There is no interesting data
which NMI accesses, but something sensitive might be co-located on the
same cache line, and then L1TF exposes that to a rogue guest.

Use VM entry MSR load list to ensure atomicity of L1D cache and VM
entry if updated microcode was loaded. If only software flush method
is available, try to help the bhyve sw flusher by also flushing L1D on
NMI exit to kernel mode.

Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D16790


# b3a7db3b 29-Jul-2018 Konstantin Belousov <kib@FreeBSD.org>

Use SMAP on amd64.

Ifuncs selectors dispatch copyin(9) family to the suitable variant, to
set rflags.AC around userspace access. Rflags.AC bit is cleared in
all kernel entry points unconditionally even on machines not
supporting SMAP.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D13838


# dbe30617 01-Jun-2018 Bruce Evans <bde@FreeBSD.org>

Fix recent breakages of kernel profiling, mostly on i386 (high resolution
kernel profiling remains broken).

memmove() was broken using ALTENTRY(). ALTENTRY() is only different from
ENTRY() in the profiling case, and its use in that case was sort of
backwards. The backwardness magically turned memmove() into memcpy()
instead of completely breaking it. Only the high resolution parts of
profiling itself were broken. Use ordinary ENTRY() for memmove().
Turn bcopy() into a tail call to memmove() to reduce complications.
This gives slightly different pessimizations and profiling lossage.
The pessimizations are minimized by not using a frame pointer() for
bcopy().

Calls to profiling functions from exception trampolines were not
relocated. This caused crashes on the first exception. Fix this using
function pointers.

Addresses of exception handlers in trampolines were not relocated. This
caused unknown offsets in the profiling data. Relocate by abusing
setidt_disp as for pmc although this is slower than necessary and
requires namespace pollution. pmc seems to be missing some relocations.
Stack traces and lots of other things in debuggers need similar relocations.

Most user addresses were misclassified as unknown kernel addresses and
then ignored. Treat all unknown addresses as user. Now only user
addresses in the kernel text range are significantly misclassified (as
known kernel addresses).

The ibrs functions didn't preserve enough registers. This is the only
recent breakage on amd64. Although these functions are written in
asm, in the profiling case they call profiling functions which are
mostly for the C ABI, so they only have to save call-used registers.
They also have to save arg and return registers in some cases and
actually save them in all cases to reduce complications. They end up
saving all registers except %ecx on i386 and %r10 and %r11 on amd64.
Saving these is only needed for 1 caller on each of amd64 and i386.
Save them there. This is slightly simpler.

Remove saving %ecx in handle_ibrs_exit on i386. Both handle_ibrs_entry
and handle_ibrs_exit use %ecx, but only the latter needed to or did
save it. But saving it there doesn't work for the profiling case.

amd64 has more automatic saving of the most common scratch registers
%rax, %rcx and %rdx (its complications for %r10 are from unusual use
of %r10 by SYSCALL). Thus profiling of handle_ibrs_exit_rs() was not
broken, and I didn't simplify the saving by moving the saving of these
registers from it to the caller.


# 053641bb 08-May-2018 Konstantin Belousov <kib@FreeBSD.org>

Prepare DB# handler for deferred trigger of watchpoints.

Since pop %ss/mov %ss instructions defer all interrupts and exceptions
for the next instruction, it is possible that the userspace watchpoint
trap executes on the first instruction of the kernel entry for
syscall/bpt.

In this case, DB# should be treated similarly to NMI: on amd64 we must
always load GSBASE even if the trap comes from kernel mode, and load
the kernel page table root into %cr3. Moreover, the trap must
use the dedicated stack, because we are still on the user stack when
trapped on syscall entry.

For i386, we must reload %cr3. The syscall instruction is not configured,
so there is no issue with executing on user stack when trapping.

Due to some CPU erratas it is not always possible to detect that the
userspace watchpoint triggered by inspecting %dr6. In trap(), compare the
trap %rip with the known unsafe entry points and if matched pretend that
the watchpoint did not fire at all.

Thank you to the MSRC Incident Response Team, and in particular Greg
Lenti and Nate Warfield, for coordinating the response to this issue
across multiple vendors.

Thanks to Computer Recycling at The Working Center of Kitchener for
making hardware available to allow us to test the patch on additional
CPU families.

Reviewed by: jhb
Discussed with: Matthew Dillon
Tested by: emaste
Sponsored by: The FreeBSD Foundation
Security: CVE-2018-8897
Security: FreeBSD-SA-18:06.debugreg


# 27275f8a 26-Apr-2018 Tycho Nightingale <tychon@FreeBSD.org>

Expand the checks for UCR3 == PMAP_NO_CR3 to enable processes to be
excluded from PTI.

Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15100


# 19c5cea3 25-Apr-2018 Tycho Nightingale <tychon@FreeBSD.org>

If a trap is encountered upon executing iretq from within doreti() the
hardware will ensure the stack pointer is aligned to a 16-byte
boundary before saving the fault state on the stack.

In the PTI case, handle this potential alignment adjustment by copying
both frames independently while unwinding the stack in between.

Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15183


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# fc2a8776 20-Mar-2018 Ed Maste <emaste@FreeBSD.org>

Rename assym.s to assym.inc

assym is only to be included by other .s files, and should never
actually be assembled by itself.

Reviewed by: imp, bdrewery (earlier)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D14180


# 319117fd 31-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

IBRS support, AKA Spectre hardware mitigation.

It is coded according to the Intel document 336996-001, reading of the
patches posted on lkml, and some additional consultations with Intel.

For existing processors, you need a microcode update which adds IBRS
CPU features, and to manually enable it by setting the tunable/sysctl
hw.ibrs_disable to 0. Current status can be checked in sysctl
hw.ibrs_active. The mitigation might be inactive if the CPU feature
is not patched in, or if CPU reports that IBRS use is not required, by
IA32_ARCH_CAP_IBRS_ALL bit.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D14029


# b4dfc9d7 19-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

PTI: Trap if we returned to userspace with kernel (full) page table
still active.

Map userspace portion of VA in the PTI kernel-mode page table as
non-executable. This way, if we ever miss reloading ucr3 into %cr3 on
the return to usermode, the process traps instead of executing in
potentially vulnerable setup. Catch the condition of such trap and
verify user-mode %cr3, which is saved by page fault handler.

I peek this trick in some article about Linux implementation.

Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 12 days
DIfferential revision: https://reviews.freebsd.org/D13956


# 68fd3b0e 18-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Use a dedicated per-CPU stack for machine check exceptions.

Similar to NMIs, machine check exceptions can fire at any time and are
not masked by IF. This means that machine checks can fire when the
kstack is too deep to hold a trap frame, or at critical sections in
trap handlers when a user %gs is used with a kernel %cs. Use the same
strategy used for NMIs of using a dedicated per-CPU stack configured
in IST 3. Store the CPU's pcpu pointer at the stop of the stack so
that the machine check handler can reliably find the proper value for
%gs (also borrowed from NMIs).

This should also fix a similar issue with PTI with a MC# occurring
while the CPU is executing on the trampoline stack.

While here, bypass trap() entirely and just call mca_intr(). This
avoids a bogus call to kdb_reenter() (there's no reason to try to
reenter kdb if a MC# is raised).

Reviewed by: kib
Tested by: avg (on AMD without PTI)
Differential Revision: https://reviews.freebsd.org/D13962


# f36b1fe0 18-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Remove two no-longer-used labels from the NMI interrupt handler.

Reviewed by: kib


# 7f513d17 18-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Adjust branch target in NMI handler for the !PTI case.

In the !PTI case the NMI handler jumped past the instructions that set
%rdi to point to the current PCB, but the target instructions assumed %rdi
were set.

Reviewed by: kib
Tested by: pho


# 9cb26f73 17-Jan-2018 Mark Johnston <markj@FreeBSD.org>

Annotate a couple of changes from r328083.

Reviewed by: kib
X-MFC with: r328083


# bd50262f 17-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

PTI for amd64.

The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability. PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.

The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:

kernel text (but not modules text)
PCPU
GDT/IDT/user LDT/task structures
IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords. Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.

IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed. The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame. Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps. They are registered using
pmap_pti_add_kva(). Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use. In principle, they can be installed into kernel page table per
pmap with some work. Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.

PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation. I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case. Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled. PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal. The fix is pending.

Reviewed by: markj (partially)
Tested by: pho (previous version)
Discussed with: jeff, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 4975c202 10-Jan-2018 Konstantin Belousov <kib@FreeBSD.org>

Do not clear %RFLAGS.DF on fast syscall entry.

Hardware already did it for us due to the mask loaded into the
MSR_SF_MASK msr register.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13838


# 0d4e7ec5 26-Aug-2017 Ryan Libby <rlibby@FreeBSD.org>

amd64: drop q suffix from rd[fg]sbase for gas compatibility

Reviewed by: kib
Approved by: markj (mentor)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12133


# 3e902b3d 21-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Make WRFSBASE and WRGSBASE instructions functional.

Right now, we enable the CR4.FSGSBASE bit on CPUs which support the
facility (Ivy and later), to allow usermode to read fs and gs bases
without syscalls. This bit also controls the write access to bases
from userspace, but WRFSBASE and WRGSBASE instructions currently
cannot be used, because return path from both exceptions or interrupts
overrides bases with the values from pcb.

Supporting the instructions is useful because this means that usermode
can implement green-threads completely in userspace without issuing
syscalls to change all of the machine context.

Support is implemented by saving the fs base and user gs base when
PCB_FULL_IRET flag is set. The flag is set on the context switch,
which potentially causes clobber of the bases due to activation of
another context, and when explicit modification of the user context by
a syscall or exception handler is performed. In particular, the patch
moves setting of the flag before syscalls change context.

The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on
entry from userspace can be considered a bug fixes on its own.

Reviewed by: jhb (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12023


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# edafb5a3 03-May-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/amd64: Small spelling fixes.

No functional change.


# 44784411 13-Apr-2016 John Baldwin <jhb@FreeBSD.org>

Expose doreti as a global symbol on amd64 and i386.

doreti provides the common code path for returning from interrupt
andlers on x86. Exposing doreti as a global symbol allows kernel
modules to include low-level interrupt handlers instead of requiring
all low-level handlers to be statically compiled into the kernel.

Submitted by: Howard Su <howard0su@gmail.com>
Reviewed by: kib


# 9054bcbc 12-Apr-2016 Andriy Gapon <avg@FreeBSD.org>

[amd64] dtrace_invop handler is to be called only for kernel exceptions

DTrace-related exceptions in userland code are handled elsewhere.
One practical problem was a crash in dtrace_invop_start() when saved
%rsp pointed to a virtual address that was not backed.

i386 code already ignored userland exceptions.

Reviewed by: markj, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D5906


# 21be12e0 27-Aug-2015 Mark Johnston <markj@FreeBSD.org>

Remove an unneeded instruction.

MFC after: 1 week


# 2d45c2d5 16-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

The iret instruction may generate #np and #ss fault, besides #gp.
When returning to usermode, the handler for that exceptions is also
executed with wrong gs base. Handle all three possible faults in the
same way, checking for iret fault, and performing full iret.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 5a5f9d21 18-Jul-2014 Mark Johnston <markj@FreeBSD.org>

Use a C wrapper for trap() instead of checking and calling the DTrace trap
hook in assembly.

Suggested by: kib
Reviewed by: kib (original version)
X-MFC-With: r268600


# 291624fd 13-Jul-2014 Mark Johnston <markj@FreeBSD.org>

Invoke the DTrace trap handler before calling trap() on amd64. This matches
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.

This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.

Submitted by: Anton Rang <anton.rang@isilon.com> (original version)
Phabric: D95


# 64e97265 29-May-2014 Konstantin Belousov <kib@FreeBSD.org>

When usermode loaded non-default segment selector into the %gs,
correctly prepare KGSBASE msr to restore the user descriptor base on
the last swapgs during return to usermode.

Reported and tested by: peterj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# c788f925 18-Jun-2013 Konstantin Belousov <kib@FreeBSD.org>

Some clarifications and updates for the comments, mostly retrieved
from Bruce Evans. Trim the trailing spaces.

MFC after: 1 week


# 6806ce6e 27-May-2013 Konstantin Belousov <kib@FreeBSD.org>

When handling an exception from the attempt from loading the faulting
context on return from the trap handler, re-enable the interrupts on
i386 and amd64. The trap return path have to disable interrupts since
the sequence of loading the machine state is not atomic. The trap()
function which transfers the control to the special handler would
enable the interrupt, but an iret loads the previous eflags with PSL_I
clear. Then, the special handler calls trap() on its own, which now
sees the original eflags with PSL_I set and does not enable
interrupts.

The end result is that signal delivery and process exiting code could
be executed with interrupts disabled, which is generally wrong and
triggers several assertions.

For amd64, the interrupts are enabled conditionally based on PSL_I in
the eflags of the outer frame, as it is already done for
doreti_iret_fault. For i386, the interrupts are enabled
unconditionally, the ast loop could have opened a window with
interrupts enabled just before the iret anyway.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 7a1c55c3 15-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Microoptimize the return path for the fast syscalls on amd64. Arrange
the code to have the fall-through path to follow the likely target.
Do not use intermediate register to reload user %rsp.

Proposed by: alc
Reviewed by: alc, jhb
Approved by: re (bz)
MFC after: 2 weeks


# e4505da6 11-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

The jump target shall be after the padding, not into it.

Reported by: alc
Approved by: re (bz)
MFC after: 2 weeks


# 8bd1142b 11-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Perform amd64-specific microoptimizations for native syscall entry
sequence. The effect is ~1% on the microbenchmark.

In particular, do not restore registers which are preserved by the
C calling sequence. Align the jump target. Avoid unneeded memory
accesses by calculating some data in syscall entry trampoline.

Reviewed by: jhb
Approved by: re (bz)
MFC after: 2 weeks


# 5ab73cbc 08-Apr-2011 Konstantin Belousov <kib@FreeBSD.org>

Disable local interrupts before testing the PCB_FULL_IRET flag.
Thread might be preempted after testing, which causes the flag to be
cleared. If ast was not delivered, we will do sysret with potentially
wrong fs/gs bases.

Reviewed by: jhb, jkim
MFC after: 1 week (together with r220430, r220452)


# 13fb631a 08-Apr-2011 John Baldwin <jhb@FreeBSD.org>

Fix a bug in the previous change to restore the fast path for syscall
return. The ast() function may cause a context switch in which case
PCB_FULL_IRET would be set in the pcb. However, the code was not
rechecking the flag after ast() returned and would not properly restore
the FSBASE and GSBASE MSRs. To fix, recheck the PCB_FULL_IRET flag after
ast() returns.

While here, trim an instruction (and memory access) from the doreti path
and fix a typo in a comment.

MFC after: 1 week


# 615d2dff 07-Apr-2011 John Baldwin <jhb@FreeBSD.org>

pcb_flags is an int, so use testl rather than testq.

Pointy hat to: jhb
Submitted by: jkim
MFC after: 1 week


# 1438b4ce 07-Apr-2011 John Baldwin <jhb@FreeBSD.org>

If a system call does not request a full interrupt return, use a fast
path via the sysretq instruction to return from the system call. This was
removed in 190620 and not quite fully restored in 195486. This resolves
most of the performance regression in system call microbenchmarks between
7 and 8 on amd64.

Reviewed by: kib
MFC after: 1 week


# 50e3cec3 22-Dec-2010 Jung-uk Kim <jkim@FreeBSD.org>

Increase size of pcb_flags to four bytes.

Requested by: bde, jhb


# e6c006d9 21-Dec-2010 Jung-uk Kim <jkim@FreeBSD.org>

Improve PCB flags handling and make it more robust. Add two new functions
for manipulating pcb_flags. These inline functions are very similar to
atomic_set_char(9) and atomic_clear_char(9) but without unnecessary LOCK
prefix for SMP. Add comments about the rationale[1]. Use these functions
wherever possible. Although there are some places where it is not strictly
necessary (e.g., a PCB is copied to create a new PCB), it is done across
the board for sake of consistency. Turn pcb_full_iret into a PCB flag as
it is safe now. Move rarely used fields before pcb_flags and reduce size
of pcb_flags to one byte. Fix some style(9) nits in pcb.h while I am in
the neighborhood.

Reviewed by: kib
Submitted by: kib[1]
MFC after: 2 months


# 0f0170e6 06-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Retire write-only PCB_FULLCTX pcb flag on amd64.

Reminded by: Petr Salinger <Petr.Salinger seznam cz>
Tested by: pho
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# cba32694 28-Aug-2010 Rui Paulo <rpaulo@FreeBSD.org>

Register an interrupt vector for DTrace return probes. There is some
code missing in lapic to make sure that we don't overwrite this entry,
but this will be done on a sequent commit.

Sponsored by: The FreeBSD Foundation


# 595473a5 23-Jun-2010 Konstantin Belousov <kib@FreeBSD.org>

Clear DF bit in eflags/rflags on the kernel entry. The i386 and amd64
ABI specifies the DF should be zero, and newer compilers do not clear
DF before using DF-sensitive instructions.

The DF clearing for signal handlers was done some time ago.

MFC after: 1 week


# 3c5636cd 19-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r207958:
Route all returns from the interrupts and faults through the doreti_iret
labeled iretq instruction.

MFC r208026:
Do not use .extern.


# 6a440fc3 12-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Route all returns from the interrupts and faults through the doreti_iret
labeled iretq instruction.

Suppose that multithreaded process executes two threads, currently
scheduled on different processors. Let assume that thread A executes
using %cs or %ss pointing into the descriptor from LDT. If IPI comes
which handler does not return by jump to doreti, and meantime thread B
invalidates descriptor pointed to by %cs or %ss, then iretq from IPI
handler could fault.

Routing the return by doreti_iret allows kernel to catch the situation
and recover from it by sending signal to the usermode.

Tested by: pho
MFC after: 1 week


# d4b0face 05-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r207570:
Style and comment adjustements.


# bf6f1b56 03-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Style and comment adjustements.

Suggested and reviewed by: bde
MFC after: 3 days


# b8fde9ef 17-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r206623:
ld_gs_base is executing with stack containing only the frame,
temporary pushed %rflags has been popped already.


# b71e04d3 14-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

ld_gs_base is executing with stack containing only the frame,
temporary pushed %rflags has been popped already.

Pointy hat to: kib
MFC after: 3 days


# a0e70f39 13-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r206459:
Handle a case when non-canonical address is loaded into the fsbase or
gsbase MSR.


# a35d07a8 10-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Handle a case when non-canonical address is loaded into the fsbase or
gsbase MSR.

MFC after: 3 days


# 4ccf64eb 06-Apr-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

MFC r205014,205015:

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

This MFC is required for MFCs of later changes to the freebsd32
compatibility from HEAD.

Requested by: kib


# 841c0c7e 11-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by: kib, jhb


# 32580301 25-Feb-2010 Attilio Rao <attilio@FreeBSD.org>

Introduce the new kernel sub-tree x86 which should contain all the code
shared and generalized between our current amd64, i386 and pc98.

This is just an initial step that should lead to a more complete effort.
For the moment, a very simple porting of cpufreq modules, BIOS calls and
the whole MD specific ISA bus part is added to the sub-tree but ideally
a lot of code might be added and more shared support should grow.

Sponsored by: Sandvine Incorporated
Reviewed by: emaste, kib, jhb, imp
Discussed on: arch
MFC: 3 weeks


# d77e2734 10-Jul-2009 Konstantin Belousov <kib@FreeBSD.org>

When amd64 CPU cannot load segment descriptor during trap return to
usermode, it generates GPF, that is mirrored to user mode as SIGSEGV.
The offending register in mcontext should contain the value loading of
which generated the GPF, and it is so on i386. On amd64, we currently
report segment descriptor in tf_err, while segment register contains the
corrected value loaded by trap handler.

Fix the issue by behaving like i386, reloading segment register in trap
frame after signal frame is pushed onto user stack.

Noted and tested by: pho
Approved by: re (kensmith)


# a2622e5d 09-Jul-2009 Konstantin Belousov <kib@FreeBSD.org>

Restore the segment registers and segment base MSRs for amd64 syscall
return path only when neither thread was context switched while
executing syscall code nor syscall explicitely modified LDT or MSRs.

Save segment registers in trap handlers before interrupts are enabled,
to not allow context switches to happen before registers are saved.
Use separated byte in pcb for indication of fast/full return, since
pcb_flags are not synchronized with context switches.

The change puts back syscall microbenchmark numbers that were slowed
down after commit of the support for LDT on amd64.

Reviewed by: jeff
Tested (and tested, and tested ...) by: pho
Approved by: re (kensmith)


# 2c66ccca 01-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

Save and restore segment registers on amd64 when entering and leaving
the kernel on amd64. Fill and read segment registers for mcontext and
signals. Handle traps caused by restoration of the
invalidated selectors.

Implement user-mode creation and manipulation of the process-specific
LDT descriptors for amd64, see sysarch(2).

Implement support for TSS i/o port access permission bitmap for amd64.

Context-switch LDT and TSS. Do not save and restore segment registers on
the context switch, that is handled by kernel enter/leave trampolines
now. Remove segment restore code from the signal trampolines for
freebsd/amd64, freebsd/ia32 and linux/i386 for the same reason.

Implement amd64-specific compat shims for sysarch.

Linuxolator (temporary ?) switched to use gsbase for thread_area pointer.

TODO:
Currently, gdb is not adapted to show segment registers from struct reg.
Also, no machine-depended ptrace command is added to set segment
registers for debugged process.

In collaboration with: pho
Discussed with: peter
Reviewed by: jhb
Linuxolator tested by: dchagin


# bb471e33 03-Feb-2009 Joseph Koshy <jkoshy@FreeBSD.org>

Improve robustness of NMI handling, for NMIs recognized in kernel
mode.

- Make the NMI handler run on its own stack (TSS_IST2).
- Store the GSBASE value for each CPU just before the start of
each NMI stack, permitting efficient retrieval using %rsp-relative
addressing.
- For NMIs taken from kernel mode, program MSR_GSBASE explicitly
since one or both of MSR_GSBASE and MSR_KGSBASE can be potentially
invalid. The current contents of MSR_GSBASE are saved and restored
at exit.
- For NMIs handled from user mode, continue to use 'swapgs' to
load the per-CPU GSBASE.

Reviewed by: jeff
Debugging help: jeff
Tested by: gnn, Artem Belevich <artemb at gmail dot com>


# a353a345 14-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

Disable interrupts, if they were enabled, before doing swapgs.
Otherwise, interrupt may happen while we run with kernel CS and usermode
gsbase.

Reviewed by: jeff
MFC after: 1 week


# 4e706fe3 14-Dec-2008 Joseph Koshy <jkoshy@FreeBSD.org>

Bug fix: %ebx needs to be preserved in the user callchain capture
path.


# 6fe00c78 13-Dec-2008 Joseph Koshy <jkoshy@FreeBSD.org>

- Bug fix: prevent a thread from migrating between CPUs between the
time it is marked for user space callchain capture in the NMI
handler and the time the callchain capture callback runs.

- Improve code and control flow clarity by invoking hwpmc(4)'s user
space callchain capture callback directly from low-level code.

Reviewed by: jhb (kern/subr_trap.c)
Testing (various patch revisions): gnn,
Fabien Thomas <fabien dot thomas at netasq dot com>,
Artem Belevich <artemb at gmail dot com>


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 8ad85ff2 18-Aug-2008 Konstantin Belousov <kib@FreeBSD.org>

The doreti_iret_fault code is always called with gs base MSR containing
kernel gs base, because %rip is adjusted only on kernel-mode trap caused
by iretq execution. On the other hand, the stack contains (hardware
part of) trap frame from the usermode. As a consequence, checking for
frame mode and doing swapgs causes the kernel to enter trap() with
usermode gs base.

Remove the check for mode and conditional swapgs, we already have right
gs base in the MSR.

Submitted by: Nate Eldredge <neldredge math ucsd edu>
MFC after: 3 days


# 367f3ce5 24-May-2008 John Birrell <jb@FreeBSD.org>

Add the DTrace hooks for exception handling (Function boundary trace
-fbt- provider), cyclic clock and syscalls.


# d07f36b0 07-Dec-2007 Joseph Koshy <jkoshy@FreeBSD.org>

Kernel and hwpmc(4) support for callchain capture.

Sponsored by: FreeBSD Foundation and Google Inc.


# 185250da 15-Nov-2007 John Baldwin <jhb@FreeBSD.org>

Add support for cross double fault frames in stack traces:
- Populate the register values for the trapframe put on the stack by the
double fault handler.
- Teach DDB's trace routine to treat a double fault like other trap frames.

MFC after: 3 days


# 4b0f4e9d 22-Dec-2006 David Xu <davidxu@FreeBSD.org>

Fix a panic when rebooting a SMP machine, when option STOP_NMI is used,
nmi handler is used to stop other processors, nmi hander calls trap(),
however, trap() now accepts a pointer rather than a reference, this was
changed by kmacy@.


# e5f8d409 16-Dec-2006 Kip Macy <kmacy@FreeBSD.org>

Newer versions of gcc don't support treating structures passed by value
as if they were really passed by reference. Specifically, the dead stores
elimination pass in the GCC 4.1 optimiser breaks the non-compliant behavior
on which FreeBSD relied. This change brings FreeBSD up to date by switching
trap frames to being explicitly passed by reference.

Reviewed by: kan
Tested by: kan


# 7ef5ed2b 27-Aug-2005 Joseph Koshy <jkoshy@FreeBSD.org>

- Special-case NMI handling on the AMD64.

On entry or exit from the kernel the 'alltraps' and 'doreti' code
used taken by normal traps disables interrupts to protect the
critical sections where it is setting up %gs.

This protection is insufficient in the presence of NMIs since NMIs
can be taken even when the processor has disabled normal interrupts.
Thus the NMI handler needs to actually read MSR_GBASE on entry to
the kernel to determine whether a swap of %gs using 'swapgs' is
needed. However, reads of MSRs are expensive and integrating this
check into the 'alltraps'/'doreti' path would penalize normal
interrupts.

- Teach DDB about the 'nmi_calltrap' symbol.

Reviewed by: bde, peter (older versions of this change)


# 0e7bd54c 25-Aug-2005 Stephan Uphoff <ups@FreeBSD.org>

NMI handler should not enable interrupts.

Tested by: kris@
MFC after: 3 weeks


# 2bdfd0d8 29-Jun-2005 Peter Wemm <peter@FreeBSD.org>

Add a special-case handler for general protection faults. It appears to
be possible to get the swapgs state reversed if doreti traps during
the iretq. Attempt to handle this. load_gs() might need special
handling too. Running the kernel with the user's TLS and the
kernel's PCPU space interchanged would be bad(TM).

Discovered as a result of a conversation with: bde
Approved by: re


# e973a7af 23-Jun-2005 Peter Wemm <peter@FreeBSD.org>

Eliminate a source of 'trap xx with interrupts disabled'. I was jumping to
the wrong backend code and neglecting to re-enable interrupts after the
stack prep.

Approved by: re


# ea5d7683 22-May-2005 Peter Wemm <peter@FreeBSD.org>

Fix some of the problems Bruce observed with this code.


# c592ba5f 20-May-2005 Peter Wemm <peter@FreeBSD.org>

For non-profiling kernels, there were two symbols assigned to the same
address. One was alltraps_with_regs_pushed, the other was calltrap.

When the stack tracer walks up, it looks for magic symbol names to
determine how to parse non-standard stack frames, such as a trapframe.
It was looking for "calltrap". Which of the two symbols you got depended
on things like Phase of moon, etc. If you were unlucky, you got a
garbage stack trace for things like 'debug.trace_on_panic', which would
completely hide the actual source of the problem.


# ba2426ff 20-Jan-2005 Peter Wemm <peter@FreeBSD.org>

MFi386: whitespace, copyright header, etc updates


# 7a071c6a 15-Aug-2004 David E. O'Brien <obrien@FreeBSD.org>

Complete 'IA32' -> 'COMPAT_IA32' change for the Linuxulator32.


# d7d197e0 23-May-2004 Bruce Evans <bde@FreeBSD.org>

Oops, ".align 4" for the data section in the previous commit should
have been ".p2align 4". This bug is cosmetic since the data section
happens to be empty.


# 003d5d66 23-May-2004 Bruce Evans <bde@FreeBSD.org>

Fixed profiling of trap, syscall and interrupt handlers and some
ordinary functions, essentially by backing out half of rev.1.115 of
amd64/exception.S. The handlers must be between certain labels for
the purposes of profiling, and this was broken by scattering them in
separately compiled .S files, especially for ordinary functions that
ended up between the labels. Merge the files by #including them as
before, except with different pathnames and better comments and
organization. Changes to the scattered files are minimal -- just
move the labels to the file that does the #includes.

This also partly fixes profiling of IPIs -- all IPI handlers are now
correctly classified as interrupt handlers, but many are still missing
mcount calls.


# 03d5ca33 23-May-2004 Bruce Evans <bde@FreeBSD.org>

Restored FAKE_MCOUNT() and MEXITCOUNT invocations and adjusted them for
amd64 as necessary. This is routine, except:
- the FAKE_MCOUNT($bintr) in doreti was missing the '$'. This gave a
a garbage address made up of padding bytes (with the nop byte 0x90 as
the MSB) instead of the intended address of bintr. This accidentally
worked on i386's because (0x90 << 24) is close enough to bintr, but
it doesn't work on amd64's because (0x90 << 56) is much further away
from bintr.
- the FAKE_MCOUNT($btrap) in calltrap was similarly broken. It hasn't
been needed since FreeBSD-1, so just delete it.


# 29ae923f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 0d2a2989 17-Nov-2003 Peter Wemm <peter@FreeBSD.org>

Initial landing of SMP support for FreeBSD/amd64.

- This is heavily derived from John Baldwin's apic/pci cleanup on i386.
- I have completely rewritten or drastically cleaned up some other parts.
(in particular, bootstrap)
- This is still a WIP. It seems that there are some highly bogus bioses
on nVidia nForce3-150 boards. I can't stress how broken these boards
are. I have a workaround in mind, but right now the Asus SK8N is broken.
The Gigabyte K8NPro (nVidia based) is also mind-numbingly hosed.
- Most of my testing has been with SCHED_ULE. SCHED_4BSD works.
- the apic and acpi components are 'standard'.
- If you have an nVidia nForce3-150 board, you are stuck with 'device
atpic' in addition, because they somehow managed to forget to connect the
8254 timer to the apic, even though its in the same silicon! ARGH!
This directly violates the ACPI spec.


# e46fe241 12-Nov-2003 Peter Wemm <peter@FreeBSD.org>

Stop pretending to support kernel profiling. The FAKE_MCOUNT() etc
calls are just gradually getting more and more stale. At this point it
would be better to start from scratch once prof_machdep.c is adapted.


# 19acc770 14-Oct-2003 Peter Wemm <peter@FreeBSD.org>

Pull the tier-2 card one last time and break the get/setcontext and
sigreturn() ABI and the signal context on the stack.

Make the trapframe (and its shadows in the ucontext and sigframe etc)
8 bytes larger in order to preserve 16 byte stack alignment for the
following C code calls. I could have done some padding after the
trapframe was saved, but some of the C code still expects an argument of
'struct trapframe'. Anyway, this gives me a spare field that can be used
to store things like 'partial trapframe' status or something else in
the future.

The runtime impact is fairly small, *except* for threaded apps and things
that decode contexts and the signal stack (eg: cvsup binary). Signal
delivery isn't too badly affected because the kernel generates the
sigframe that sigreturn uses after the handler has been called.

The size of mcontext_t and struct sigframe hasn't changed. Only
the last few fields (sc_eip etc) got moved a little and I eliminated
a spare field. mc_len/sc_len did change location though so the
sanity checks there will still trap it.


# e63f19e1 22-Sep-2003 Peter Wemm <peter@FreeBSD.org>

MFi386 rev 1.105 by jhb: fix comment typo


# 2c25d124 09-Sep-2003 Peter Wemm <peter@FreeBSD.org>

Clean up get/set_mcontext() and get/set_fpcontext(). These are operated
on data structures on the kernel stack which are guaranteed to be 16 byte
aligned by gcc, the amd64 ABI and __aligned(16).

Ensire the tss_rsp0 initial stack pointer is 16 byte aligned in case
sizeof(pcb) becomes odd at some point. This is convenient for the
interrupt handler case because the ring crossing pushes cause the
required odd alignment before the call to the C code.

Have fast_syscall add an additional 8 bytes to ensure that the trapframe
has the correct odd alignment for the call to C code. Note that there are
no checks to make sure that the trapframe size is appropriate for this.

This makes get/setfpcontext work properly (finally). You get a GPF in
kernel mode if any of this is botched without the alignment fixup code
that is apparently needed on i386.


# d85631c4 13-May-2003 Peter Wemm <peter@FreeBSD.org>

Add BASIC i386 binary support for the amd64 kernel. This is largely
stolen from the ia64/ia32 code (indeed there was a repocopy), but I've
redone the MD parts and added and fixed a few essential syscalls. It
is sufficient to run i386 binaries like /bin/ls, /usr/bin/id (dynamic)
and p4. The ia64 code has not implemented signal delivery, so I had
to do that.

Before you say it, yes, this does need to go in a common place. But
we're in a freeze at the moment and I didn't want to risk breaking ia64.
I will sort this out after the freeze so that the common code is in a
common place.

On the AMD64 side, this required adding segment selector context switch
support and some other support infrastructure. The %fs/%gs etc code
is hairy because loading %gs will clobber the kernel's current MSR_GSBASE
setting. The segment selectors are not used by the kernel, so they're only
changed at context switch time or when changing modes. This still needs
to be optimized.

Approved by: re (amd64/* blanket)


# 0fe93e74 12-May-2003 Peter Wemm <peter@FreeBSD.org>

For the page fault handler, save %cr2 in the outer trap handler so that
we do not have to run so long with interrupts disabled. This involved
creating tf_addr in the trapframe. Reorganize the trap stubs so that
they consistently reserve the stack space and initialize any missing
bits.

Approved by: re (amd64 stuff)


# bf1e8974 11-May-2003 Peter Wemm <peter@FreeBSD.org>

Give a %fs and %gs to userland. Use swapgs to obtain the kernel %GS.base
value on entry and exit. This isn't as easy as it sounds because when
we recursively trap or interrupt, we have to avoid duplicating the
swapgs instruction or we end up back with the userland %gs. I implemented
this by testing TF_CS to see if we're coming from supervisor mode
already, and check for returning to supervisor. To avoid a race with
interrupts in the brief period after beginning executing the handler and
before the swapgs, convert all trap gates to interrupt gates, and reenable
interrupts immediately after the swapgs. I am not happy with this.
There are other possible ways to do this that should be investigated.
(eg: storing the GS.base MSR value in the trapframe)

Add some sysarch functions to let the userland code get to this.

Approved by: re (blanket amd64/*)


# 4ce3e250 11-May-2003 Peter Wemm <peter@FreeBSD.org>

Since compiling natively, the compile environment has been less forgiving
about silly typos. Use the correct comment sequences.


# 9c43b77f 07-May-2003 Peter Wemm <peter@FreeBSD.org>

Fix a preemption race. I was reenabling interrupts in the fast system
call handler before it was safe. It was possible for to lose context
and for something else to clobber the PCPU scratch variable. This
moves the interrupt enable *way* too late, but its better safe than
sorry for the moment.


# 5b27e934 02-May-2003 Peter Wemm <peter@FreeBSD.org>

Repocopy *.s to *.S


# afa88623 30-Apr-2003 Peter Wemm <peter@FreeBSD.org>

Commit MD parts of a loosely functional AMD64 port. This is based on
a heavily stripped down FreeBSD/i386 (brutally stripped down actually) to
attempt to get a stable base to start from. There is a lot missing still.
Worth noting:
- The kernel runs at 1GB in order to cheat with the pmap code. pmap uses
a variation of the PAE code in order to avoid having to worry about 4
levels of page tables yet.
- It boots in 64 bit "long mode" with a tiny trampoline embedded in the
i386 loader. This simplifies locore.s greatly.
- There are still quite a few fragments of i386-specific code that have
not been translated yet, and some that I cheated and wrote dumb C
versions of (bcopy etc).
- It has both int 0x80 for syscalls (but using registers for argument
passing, as is native on the amd64 ABI), and the 'syscall' instruction
for syscalls. int 0x80 preserves all registers, 'syscall' does not.
- I have tried to minimize looking at the NetBSD code, except in a couple
of places (eg: to find which register they use to replace the trashed
%rcx register in the syscall instruction). As a result, there is not a
lot of similarity. I did look at NetBSD a few times while debugging to
get some ideas about what I might have done wrong in my first attempt.


# 4a338afd 17-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Move a bunch of flags from the KSE to the thread.
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.

Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@


# 6f8132a8 31-Jan-2003 Julian Elischer <julian@FreeBSD.org>

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# aff81a81 28-Jan-2003 Jake Burkholder <jake@FreeBSD.org>

Remove BDE_DEBUGGER.

Discussed with: bde


# 0dbb100b 26-Jan-2003 David Xu <davidxu@FreeBSD.org>

Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian


# 8c132e9f 06-Nov-2002 David Xu <davidxu@FreeBSD.org>

1.Fix smp race between kernel vm86 BIOS calling and userland vm86 mode code,
remove global variable in_vm86call, set vm86 calling flag in PCB flags.

2.Fix vm86 BIOS calling preempted problem by changing vm86_lock mutex type
from MTX_DEF to MTX_SPIN. vm86pcb is not remembered in thread struct,
when the thread calling vm86 BIOS is preempted by interrupt thread,
and later switching back to the thread would cause incorrect context be
loaded into CPU registers, this leads to kernel crash.


# 23efa1f8 27-Jul-2002 Peter Wemm <peter@FreeBSD.org>

Unwind the syscall_with_err_pushed tweak that jake did some time back.

OK'ed by: jake


# 83587634 10-Jul-2002 Julian Elischer <julian@FreeBSD.org>

fix a comment and note a problem with XXXSMP


# 50b6a555 10-Jul-2002 Matthew Dillon <dillon@FreeBSD.org>

Remove the critmode sysctl - the new method for critical_enter/exit (already
the default) is now the only method for i386.

Remove the paraphanalia that supported critmode. Remove td_critnest, clean
up the assembly, and clean up (mostly remove) the old junk from
cpu_critical_enter() and cpu_critical_exit().


# f55348f0 09-Jul-2002 Julian Elischer <julian@FreeBSD.org>

Include all of isa/ipl.s into exception.s as there is now nothing left in
ipl.s except doreti which really belongs in with the exceptions as it's
just the other side of the same coin. Will remove ipl.s in a separate commit.

Agreed by: several including bde@freebsd.org


# d74ac681 26-Mar-2002 Matthew Dillon <dillon@FreeBSD.org>

Compromise for critical*()/cpu_critical*() recommit. Cleanup the interrupt
disablement assumptions in kern_fork.c by adding another API call,
cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it
from MI to MD. Temporarily move cpu_critical*() from <arch>/include/cpufunc.h
to <arch>/<arch>/critical.c (stage-2 will clean this up).

Implement interrupt deferral for i386 that allows interrupts to remain
enabled inside critical sections. This also fixes an IPI interlock bug,
and requires uses of icu_lock to be enclosed in a true interrupt disablement.

This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized,
and will move cpu_critical*() into its own header file(s) + other things.
This commit may break non-i386 architectures in trivial ways. This should
be temporary.

Reviewed by: core
Approved by: core


# 181df8c9 26-Feb-2002 Matthew Dillon <dillon@FreeBSD.org>

revert last commit temporarily due to whining on the lists.


# f96ad4c2 26-Feb-2002 Matthew Dillon <dillon@FreeBSD.org>

STAGE-1 of 3 commit - allow (but do not require) interrupts to remain
enabled in critical sections and streamline critical_enter() and
critical_exit().

This commit allows an architecture to leave interrupts enabled inside
critical sections if it so wishes. Architectures that do not wish to do
this are not effected by this change.

This commit implements the feature for the I386 architecture and provides
a sysctl, debug.critical_mode, which defaults to 1 (use the feature). For
now you can turn the sysctl on and off at any time in order to test the
architectural changes or track down bugs.

This commit is just the first stage. Some areas of the code, specifically
the MACHINE_CRITICAL_ENTER #ifdef'd code, is strictly temporary and will
be cleaned up in the STAGE-2 commit when the critical_*() functions are
moved entirely into MD files.

The following changes have been made:

* critical_enter() and critical_exit() for I386 now simply increment
and decrement curthread->td_critnest. They no longer disable
hard interrupts. When critical_exit() decrements the counter to
0 it effectively calls a routine to deal with whatever interrupts
were deferred during the time the code was operating in a critical
section.

Other architectures are unaffected.

* fork_exit() has been conditionalized to remove MD assumptions for
the new code. Old code will still use the old MD assumptions
in regards to hard interrupt disablement. In STAGE-2 this will
be turned into a subroutine call into MD code rather then hardcoded
in MI code.

The new code places the burden of entering the critical section
in the trampoline code where it belongs.

* I386: interrupts are now enabled while we are in a critical section.
The interrupt vector code has been adjusted to deal with the fact.
If it detects that we are in a critical section it currently defers
the interrupt by adding the appropriate bit to an interrupt mask.

* In order to accomplish the deferral, icu_lock is required. This
is i386-specific. Thus icu_lock can only be obtained by mainline
i386 code while interrupts are hard disabled. This change has been
made.

* Because interrupts may or may not be hard disabled during a
context switch, cpu_switch() can no longer simply assume that
PSL_I will be in a consistent state. Therefore, it now saves and
restores eflags.

* FAST INTERRUPT PROVISION. Fast interrupts are currently deferred.
The intention is to eventually allow them to operate either while
we are in a critical section or, if we are able to restrict the
use of sched_lock, while we are not holding the sched_lock.

* ICU and APIC vector assembly for I386 cleaned up. The ICU code
has been cleaned up to match the APIC code in regards to format
and macro availability. Additionally, the code has been adjusted
to deal with deferred interrupts.

* Deferred interrupts use a per-cpu boolean int_pending, and
masks ipending, spending, and fpending. Being per-cpu variables
it is not currently necessary to lock; bus cycles modifying them.

Note that the same mechanism will enable preemption to be
incorporated as a true software interrupt without having to
further hack up the critical nesting code.

* Note: the old critical_enter() code in kern/kern_switch.c is
currently #ifdef to be compatible with both the old and new
methodology. In STAGE-2 it will be moved entirely to MD code.

Performance issues:

One of the purposes of this commit is to enhance critical section
performance, specifically to greatly reduce bus overhead to allow
the critical section code to be used to protect per-cpu caches.
These caches, such as Jeff's slab allocator work, can potentially
operate very quickly making the effective savings of the new
critical section code's performance very significant.

The second purpose of this commit is to allow architectures to
enable certain interrupts while in a critical section. Specifically,
the intention is to eventually allow certain FAST interrupts to
operate rather then defer.

The third purpose of this commit is to begin to clean up the
critical_enter()/critical_exit()/cpu_critical_enter()/
cpu_critical_exit() API which currently has serious cross pollution
in MI code (in fork_exit() and ast() for example).

The fourth purpose of this commit is to provide a framework that
allows kernel-preempting software interrupts to be implemented
cleanly. This is currently used for two forward interrupts in I386.
Other architectures will have the choice of using this infrastructure
or building the functionality directly into critical_enter()/
critical_exit().

Finally, this commit is designed to greatly improve the flexibility
of various architectures to manage critical section handling,
software interrupts, preemption, and other highly integrated
architecture-specific details.


# d2f22d70 10-Feb-2002 Bruce Evans <bde@FreeBSD.org>

Garbage-collect the "LOCORE" version of MPLOCKED.


# a0f831f5 24-Aug-2001 John Baldwin <jhb@FreeBSD.org>

Remove references to the old giant kernel lock in various comments.


# d072acf9 12-Aug-2001 Bruce Evans <bde@FreeBSD.org>

Removed he BPTTRAP() macro and its use. It was intended for restoring
bug for bug compatibility to ddb trap handlers after fixing the debugger
trap gates to be interrupt gates, but the fix was never committed. Now
I want the fix to apply to ddb.


# 9d146ac5 12-Jul-2001 Peter Wemm <peter@FreeBSD.org>

Activate SSE/SIMD. This is the extra context switching support that
we are required to do if we let user processes use the extra 128 bit
registers etc.

This is the base part of the diff I got from:
http://www.issei.org/issei/FreeBSD/sse.html
I believe this is by: Mr. SUZUKI Issei <issei@issei.org>
SMP support apparently by: Takekazu KATO <kato@chino.it.okayama-u.ac.jp>
Test code by: NAKAMURA Kazushi <kaz@kobe1995.net>, see
http://kobe1995.net/~kaz/FreeBSD/SSE.en.html

I have fixed a couple of style(9) deviations. I have some followup
commits to fix a couple of non-style things.


# 1c1771cb 22-May-2001 Bruce Evans <bde@FreeBSD.org>

Convert npx interrupts into traps instead of vice versa. This is much
simpler for npx exceptions that start as traps (no assembly required...)
and works better for npx exceptions that start as interrupts (there is
no longer a problem for nested interrupts).

Submitted by: original (pre-SMPng) version by luoqi


# 8bd57f8f 15-May-2001 John Baldwin <jhb@FreeBSD.org>

Remove unneeded includes of sys/ipl.h and machine/ipl.h.


# 02318dac 24-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

Remove the leading underscore from all symbols defined in x86 asm
and used in C or vice versa. The elf compiler uses the same names
for both. Remove asnames.h with great prejudice; it has served its
purpose.

Note that this does not affect the ability to generate an aout kernel
due to gcc's -mno-underscores option.

moral support from: peter, jhb


# 631d7bf3 24-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

- Rename the lcall system call handler from Xsyscall to Xlcall_syscall
to be more like Xint0x80_syscall and less like c function syscall().
- Reduce code duplication between the int0x80 and lcall handlers by
shuffling the elfags into the right place, saving the sizeof the
instruction in tf_err and jumping into the common int0x80 code.

Reviewed by: peter


# d888fc4e 11-Feb-2001 Mark Murray <markm@FreeBSD.org>

RIP <machine/lock.h>.

Some things needed bits of <i386/include/lock.h> - cy.c now has its
own (only) copy of the COM_(UN)LOCK() macros, and IMASK_(UN)LOCK()
has been moved to <i386/include/apic.h> (AKA <machine/apic.h>).
Reviewed by: jhb


# 142ba5f3 09-Feb-2001 John Baldwin <jhb@FreeBSD.org>

- Make astpending and need_resched process attributes rather than CPU
attributes. This is needed for AST's to be properly posted in a preemptive
kernel. They are backed by two new flags in p_sflag: PS_ASTPENDING and
PS_NEEDRESCHED. They are still accesssed by their old macros:
aston(), astoff(), etc. For completeness, an astpending() macro has been
added to check for a pending AST, and clear_resched() has been added to
clear need_resched().
- Rename syscall2() on the x86 back to syscall() to be consistent with
other architectures.


# 2a36ec35 24-Jan-2001 John Baldwin <jhb@FreeBSD.org>

- Change fork_exit() to take a pointer to a trapframe as its 3rd argument
instead of a trapframe directly. (Requested by bde.)
- Convert the alpha switch_trampoline to call fork_exit() and use the MI
fork_return() instead of child_return().
- Axe child_return().


# 89cb8fc9 24-Jan-2001 John Baldwin <jhb@FreeBSD.org>

Call fork_exit() now instead of futzing around in assembly during a fork
return.


# a448b62a 21-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

Make intr_nesting_level per-process, rather than per-cpu. Setup
interrupt threads to run with it always >= 1, so that malloc can
detect M_WAITOK from "interrupt" context. This is also necessary
in order to context switch from sched_ithd() directly.

Reviewed By: peter


# 87dce368 19-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

Simplify the i386 asm MTX_{ENTER,EXIT} macros to just call the
appropriate function, rather than doing a horse-and-buggy
acquire. They now take the mutex type as an arg and can be
used with sleep as well as spin mutexes.


# c1ef8aac 19-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

- Make npx_intr INTR_MPSAFE and move acquiring Giant into the
function itself.
- Remove a hack to allow acquiring Giant from the npx asm trap
vector.


# 558226ea 19-Jan-2001 Peter Wemm <peter@FreeBSD.org>

Use #ifdef DEV_NPX from opt_npx.h instead of #if NNPX > 0 from npx.h


# 41ed17bf 06-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

Use %fs to access per-cpu variables in uni-processor kernels the same
as multi-processor kernels. The old way made it difficult for kernel
modules to be portable between uni-processor and multi-processor
kernels. It is no longer necessary to jump through hoops.

- always load %fs with the private segment on entry to the kernel
- change the type of the self referntial pointer from struct privatespace
to struct globaldata
- make the globaldata symbol have value 0 in all cases, so the symbols
in globals.s are always offsets, not aliases for fields in globaldata
- define the globaldata space used for uniprocessor kernels in C, rather
than assembler
- change the assmebly language accessors to use %fs, add a macro
PCPU_ADDR(member, reg), which loads the register reg with the address
of the per-cpu variable member


# 6d43764a 13-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

Introduce a new potientially cleaner interface for accessing per-cpu
variables from i386 assembly language. The syntax is PCPU(member)
where member is the capitalized name of the per-cpu variable, without
the gd_ prefix. Example: movl %eax,PCPU(CURPROC). The capitalization
is due to using the offsets generated by genassym rather than the symbols
provided by linking with globals.o. asmacros.h is the wrong place for
this but it seemed as good a place as any for now. The old implementation
in asnames.h has not been removed because it is still used to de-mangle
the symbols used by the C variables for the UP case.


# 21ad98bc 30-Nov-2000 Jake Burkholder <jake@FreeBSD.org>

Change doreti to take a trapframe instead of an intrframe.
Remove associated pushes of dummy units to convert frame.

Reviewed by: jhb


# 4ae465d0 14-Nov-2000 John Baldwin <jhb@FreeBSD.org>

Always enable interrupts during fork_trampoline() after releasing the
sched_lock. This is needed for kernel threads that are created before
interrupts are enabled. kthreads created by kld's that are created at
SI_SUB_KLD such as the random kthread.

Tested by: phk


# 03cea59c 05-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Remove an unnecessary sti and spl0() in fork_trampoline. Interrupts
should be enabled by MTX_EXIT() now when it releases the sched_lock.


# 1931cf94 05-Oct-2000 John Baldwin <jhb@FreeBSD.org>

- Heavyweight interrupt threads on the alpha for device I/O interrupts.
- Make softinterrupts (SWI's) almost completely MI, and divorce them
completely from the x86 hardware interrupt code.
- The ihandlers array is now gone. Instead, there is a MI shandlers array
that just contains SWI handlers.
- Most of the former machine/ipl.h files have moved to a new sys/ipl.h.
- Stub out all the spl*() functions on all architectures.

Submitted by: dfr


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# fc81cf82 09-May-2000 David E. O'Brien <obrien@FreeBSD.org>

1. `movl' is for use with 32-bit operands. Do NOT use it with 16-bit
operands. `movw' could be used, but instead let the assembler decide
the right instruction to use.
2. AT&T asm syntax requires a leading '*' in front of the operand for
indirect calls and jumps.


# bd5caafc 28-Mar-2000 Matthew Dillon <dillon@FreeBSD.org>

The SMP cleanup commit broke need_resched, this fixes that and also
removed unncessary MPLOCKED and 'lock' prefixes from the interrupt
nesting level, since (A) the MP lock is held at the time, and (B) since
the neting level is restored prior to return any interrupted code
will see a consistent value.


# 36e9f877 28-Mar-2000 Matthew Dillon <dillon@FreeBSD.org>

Commit major SMP cleanups and move the BGL (big giant lock) in the
syscall path inward. A system call may select whether it needs the MP
lock or not (the default being that it does need it).

A great deal of conditional SMP code for various deadended experiments
has been removed. 'cil' and 'cml' have been removed entirely, and the
locking around the cpl has been removed. The conditional
separately-locked fast-interrupt code has been removed, meaning that
interrupts must hold the CPL now (but they pretty much had to anyway).
Another reason for doing this is that the original separate-lock for
interrupts just doesn't apply to the interrupt thread mechanism being
contemplated.

Modifications to the cpl may now ONLY occur while holding the MP
lock. For example, if an otherwise MP safe syscall needs to mess with
the cpl, it must hold the MP lock for the duration and must (as usual)
save/restore the cpl in a nested fashion.

This is precursor work for the real meat coming later: avoiding having
to hold the MP lock for common syscalls and I/O's and interrupt threads.
It is expected that the spl mechanisms and new interrupt threading
mechanisms will be able to run in tandem, allowing a slow piecemeal
transition to occur.

This patch should result in a moderate performance improvement due to
the considerable amount of code that has been removed from the critical
path, especially the simplification of the spl*() calls. The real
performance gains will come later.

Approved by: jkh
Reviewed by: current, bde (exception.s)
Some work taken from: luoqi's patch


# b812525f 25-Nov-1999 Julian Elischer <julian@FreeBSD.org>

Fix out-of-date comment


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# eec2e836 10-Jul-1999 Bruce Evans <bde@FreeBSD.org>

Go back to the old (icu.s rev.1.7 1993) way of keeping the AST-pending
bit separate from ipending, since this is simpler and/or necessary for
SMP and may even be better for UP.

Reviewed by: alc, luoqi, tegge


# fd56d8b7 27-Jun-1999 Alan Cox <alc@FreeBSD.org>

An SMP-specific change: Remove an unnecessary lock acquire and release
from every system call. (Storing a 32-bit constant is inherently
atomic.)

Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>


# eb9d435a 01-Jun-1999 Jonathan Lemon <jlemon@FreeBSD.org>

Unifdef VM86.

Reviewed by: silence on on -current


# ea2b3e3d 06-May-1999 Bruce Evans <bde@FreeBSD.org>

Fixed profiling of elf kernels. Made high resolution profiling compile
for elf kernels (it is broken for all kernels due to lack of egcs support).

Renaming of many assembler labels is avoided by declaring by declaring
the labels that need to be visible to gprof as having type "function"
and depending on the elf version of gprof being zealous about discarding
the others. A few type declarations are still missing, mainly for SMP.

PR: 9413
Submitted by: Assar Westerlund <assar@sics.se> (initial parts)


# 5206bca1 27-Apr-1999 Luoqi Chen <luoqi@FreeBSD.org>

Enable vmspace sharing on SMP. Major changes are,
- %fs register is added to trapframe and saved/restored upon kernel entry/exit.
- Per-cpu pages are no longer mapped at the same virtual address.
- Each cpu now has a separate gdt selector table. A new segment selector
is added to point to per-cpu pages, per-cpu global variables are now
accessed through this new selector (%fs). The selectors in gdt table are
rearranged for cache line optimization.
- fask_vfork is now on as default for both UP and SMP.
- Some aio code cleanup.

Reviewed by: Alan Cox <alc@cs.rice.edu>
John Dyson <dyson@iquest.net>
Julian Elischer <julian@whistel.com>
Bruce Evans <bde@zeta.org.au>
David Greenman <dg@root.com>


# 6182fdbd 16-Apr-1999 Peter Wemm <peter@FreeBSD.org>

Bring the 'new-bus' to the i386. This extensively changes the way the
i386 platform boots, it is no longer ISA-centric, and is fully dynamic.
Most old drivers compile and run without modification via 'compatability
shims' to enable a smoother transition. eisa, isapnp and pccard* are
not yet using the new resource manager. Once fully converted, all drivers
will be loadable, including PCI and ISA.

(Some other changes appear to have snuck in, including a port of Soren's
ATA driver to the Alpha. Soren, back this out if you need to.)

This is a checkpoint of work-in-progress, but is quite functional.

The bulk of the work was done over the last few years by Doug Rabson and
Garrett Wollman.

Approved by: core


# e7ba67f2 28-Feb-1999 Bruce Evans <bde@FreeBSD.org>

Removed all traces of `p_switchtime'. The relevant timestamp is per-cpu,
not per-process. Keep it in `switchtime' consistently.

It is now clear that the timestamp is always valid in fork_trampoline()
except when the child is running on a previously idle cpu, which
can only happen if there are multiple cpus, so don't check or set
the timestamp in fork_trampoline except in the (i386) SMP case.
Just remove the alpha code for setting it unconditionally, since
there is no SMP case for alpha and the code had rotted.

Parts reviewed by: dfr, phk


# 1b0b259e 25-Feb-1999 Bruce Evans <bde@FreeBSD.org>

Don't forget to update `switchticks' in corner cases (except for
the alpha fork_trampoline(), forget it because it I believe it is
only necessary for the unsupported SMP case).


# 0c61bd3a 10-Aug-1998 Bruce Evans <bde@FreeBSD.org>

Fixed restoring of cpl after trap handling. The wrong cpl (SWI_AST_MASK
instead of 0) was "restored" after handling a trap that occurred while
returning to user mode. This bug was most noticeable for VM86 and is
still detected and fixed up (on return from the next exception) in doreti
if VM86 is configured.


# e539380e 28-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Set p->p_switchtime to switchtime instead of to the current time in
fork_trampoline() if switchtime is valid. This fixes not accounting
for the time between the previous context switch and and the current
time (when the forked child starts up here) in most cases - the time
is now counted in the child's runtime. I think it actually fixes
all cases, and switchtime is always valid here, since there must have
been a context switch just before the forked child starts up. Some
code should be removed if this is correct. The check that switchtime
is valid sometimes gives a false negative because the check isn't
correct until the after the first context switch after the system
has been up for >= 1 second.


# e796e00d 28-May-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Some cleanups related to timecounters and weird ifdefs in <sys/time.h>.

Clean up (or if antipodic: down) some of the msgbuf stuff.

Use an inline function rather than a macro for timecounter delta.

Maintain process "on-cpu" time as 64 bits of microseconds to avoid
needless second rollover overhead.

Avoid calling microuptime the second time in mi_switch() if we do
not pass through _idle in cpu_switch()

This should reduce our context-switch overhead a bit, in particular
on pre-P5 and SMP systems.

WARNING: Programs which muck about with struct proc in userland
will have to be fixed.

Reviewed, but found imperfect by: bde


# c21410e1 17-May-1998 Poul-Henning Kamp <phk@FreeBSD.org>

s/nanoruntime/nanouptime/g
s/microruntime/microuptime/g

Reviewed by: bde


# dc733423 17-Apr-1998 Dag-Erling Smørgrav <des@FreeBSD.org>

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


# dcfe0058 15-Apr-1998 Bruce Evans <bde@FreeBSD.org>

Fixed breakage of fork accounting in previous commit. A fork benchmark
reported about 15 times as much sys time as real time. getmicroruntime()
is confusing name.


# 00af9731 04-Apr-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Time changes mark 2:

* Figure out UTC relative to boottime. Four new functions provide
time relative to boottime.

* move "runtime" into struct proc. This helps fix the calcru()
problem in SMP.

* kill mono_time.

* add timespec{add|sub|cmp} macros to time.h. (XXX: These may change!)

* nanosleep, select & poll takes long sleeps one day at a time

Reviewed by: bde
Tested by: ache and others


# 640c4313 23-Mar-1998 Jonathan Lemon <jlemon@FreeBSD.org>

Add the ability to make real-mode BIOS calls from the kernel. Currently,
everything is contained inside #ifdef VM86, so this option must be
present in the config file to use this functionality.

Thanks to Tor Egge, these changes should work on SMP machines. However,
it may not be throughly SMP-safe.

Currently, the only BIOS calls made are memory-sizing routines at bootup,
these replace reading the RTC values.


# b3c0d232 27-Oct-1997 Bruce Evans <bde@FreeBSD.org>

Oops, <machine/psl.h> is used unconditionally in -current.


# ba21341d 27-Oct-1997 Bruce Evans <bde@FreeBSD.org>

Cleaned up #includes.

Ifdefed conditionally used includes.

Finished changing indentation of per-statement comments to 40.


# 98823b23 10-Oct-1997 Peter Wemm <peter@FreeBSD.org>

Convert the VM86 option from a global option to an option only depended
on by the files that use it. Changing the VM86 option now only causes
a recompile of a dozen files or so rather than the entire kernel.


# 20233f27 07-Sep-1997 Steve Passe <fsmp@FreeBSD.org>

General cleanup of the lock pushdown code. They are grouped and enabled
from machine/smptests.h:

#define PUSHDOWN_LEVEL_1
#define PUSHDOWN_LEVEL_2
#define PUSHDOWN_LEVEL_3
#define PUSHDOWN_LEVEL_4_NOT


# 78292efe 30-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

Another round of lock pushdown.
Add a simplelock to deal with disable_intr()/enable_intr() as used in UP kernel.
UP kernel expects that this is enough to guarantee exclusive access to
regions of code bracketed by these 2 functions.
Add a simplelock to bracket clock accesses in clock.c: clock_lock.

Help from: Bruce Evans <bde@zeta.org.au>


# 5f642c16 29-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

Support for the new FAST_HI algorithm.
Improved interrupt handling, fewer silo overflows.

With help from: dave adkins <adkin003@gold.tc.umn.edu>


# 886e7896 23-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

The last of the encapsolation of cpl/spl/ipending things into a critical
region protected by the simplelock 'cpl_lock'.

Notes:

- this code is currently controlled on a section by section basis with
defines in machine/param.h. All sections are currently enabled.

- this code is not as clean as I would like, but that can wait till later.

- the "giant lock" still surrounds most instances of this "cpl region".
I still have to do the code that arbitrates setting cpl between the
top and bottom halves of the kernel.

- the possibility of deadlock exists, I am committing the code at this
point so as to exercise it and detect any such cases B4 the "giant lock"
is removed.


# 4a73d99f 20-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

Made PEND_INTS default.
Made NEW_STRATEGY default.
Removed misc. old cruft.

Centralized simple locks into mp_machdep.c
Centralized simple lock macros into param.h

More cleanup in the direction of making splxx()/cpl MP-safe.


# 7b185ef8 19-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

Preperation for moving cpl into critical region access.
Several new fine-grained locks.
New FAST_INTR() methods:
- separate simplelock for FAST_INTR, no more giant lock.
- FAST_INTR()s no longer checks ipending on way out of ISR.
sio made MP-safe (I hope).


# a5e8237d 10-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

Oops, fix breakage to UP kernel.


# e7802310 10-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

Added trap specific lock calls: get_fpu_lock, etc.
All resolve to the GIANT_LOCK at this time, it is purely a logical partitioning.


# 8a5da002 09-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

Minor conditionalization of XXX_MPLOCK on PEND_INTS.


# 48a09cf2 08-Aug-1997 John Dyson <dyson@FreeBSD.org>

VM86 kernel support.
Work done by BSDI, Jonathan Lemon <jlemon@americantv.com>,
Mike Smith <msmith@gsoft.com.au>, Sean Eric Fagan <sef@kithrup.com>,
and probably alot of others.
Submitted by: Jnathan Lemon <jlemon@americantv.com>


# da9f0182 30-Jul-1997 Steve Passe <fsmp@FreeBSD.org>

Converted the TEST_LOPRIO code to default.
Created mplock functions that save/restore NO registers.
Minor cleanup.


# 812e4da7 23-Jul-1997 Steve Passe <fsmp@FreeBSD.org>

New simple_lock code in asm:
- s_lock_init()
- s_lock()
- s_lock_try()
- s_unlock()

Created lock for IO APIC and apic_imen (SMP version of imen)
- imen_lock

Code to use imen_lock for access from apic_ipl.s and apic_vector.s.
Moved this code *outside* of mp_lock.

It seems to work!!!


# e31521c3 20-Jul-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# dc514523 30-Jun-1997 Bruce Evans <bde@FreeBSD.org>

Un-inline a call to spl0(). It is not time critical, and was only inline
because there was no non-inline spl0() to call.


# b3196e4b 22-Jun-1997 Peter Wemm <peter@FreeBSD.org>

Preliminary support for per-cpu data pages.

This eliminates a lot of #ifdef SMP type code. Things like _curproc reside
in a data page that is unique on each cpu, eliminating the expensive macros
like: #define curproc (SMPcurproc[cpunumber()])

There are some unresolved bootstrap and address space sharing issues at
present, but Steve is waiting on this for other work. There is still some
strictly temporary code present that isn't exactly pretty.

This is part of a larger change that has run into some bumps, this part is
standalone so it should be safe. The temporary code goes away when the
full idle cpu support is finished.

Reviewed by: fsmp, dyson


# 5400ed3b 31-May-1997 Peter Wemm <peter@FreeBSD.org>

Include file updates.. <machine/spl.h> -> <machine/ipl.h>, add
<machine/ipl.h> to those files that were depending on getting SWI_*
implicitly via <machine/cpufunc.h>


# 288e2230 28-May-1997 Peter Wemm <peter@FreeBSD.org>

remove no longer needed opt_smp.h includes


# 694f6724 26-May-1997 Steve Passe <fsmp@FreeBSD.org>

Changed inclusion of isa/icu.s to isa/ipl.s.
This is part of the breakup of UP/SMP specific INTerrupt code.


# 8359d00f 07-May-1997 Peter Wemm <peter@FreeBSD.org>

forgotten comment


# 477a642c 26-Apr-1997 Peter Wemm <peter@FreeBSD.org>

Man the liferafts! Here comes the long awaited SMP -> -current merge!

There are various options documented in i386/conf/LINT, there is more to
come over the next few days.

The kernel should run pretty much "as before" without the options to
activate SMP mode.

There are a handful of known "loose ends" that need to be fixed, but
have been put off since the SMP kernel is in a moderately good condition
at the moment.

This commit is the result of the tinkering and testing over the last 14
months by many people. A special thanks to Steve Passe for implementing
the APIC code!


# d12ee02d 13-Apr-1997 Bruce Evans <bde@FreeBSD.org>

Don't forget to set `runtime' in fork_trampoline(). The time slice before
switching to a child for the first time was being counted twice. I think
this only affected unimportant statistics.

Simplified arg handling in fork_trampoline(). splz() doesn't actually
smash the registers of interest.


# e2d87711 07-Apr-1997 Peter Wemm <peter@FreeBSD.org>

Lower the spl() of the new process from splhigh() right away, since
nothing else will lower it until either much later, or never(?) for
kernel processes.

This basically re-fixes what Bruce fixed in rev 1.29 of kern_fork.c,
which was broken again now the child does not execute back up the fork()
calling tree.


# a2a1c95c 07-Apr-1997 Peter Wemm <peter@FreeBSD.org>

The biggie: Get rid of the UPAGES from the top of the per-process address
space. (!)

Have each process use the kernel stack and pcb in the kvm space. Since
the stacks are at a different address, we cannot copy the stack at fork()
and allow the child to return up through the function call tree to return
to user mode - create a new execution context and have the new process
begin executing from cpu_switch() and go to user mode directly.
In theory this should speed up fork a bit.

Context switch the tss_esp0 pointer in the common tss. This is a lot
simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer
to each process's tss since the esp0 pointer is a 32 bit pointer, and the
sd_base setting is split into three different bit sections at non-aligned
boundaries and requires a lot of twiddling to reset.

The 8K of memory at the top of the process space is now empty, and unmapped
(and unmappable, it's higher than VM_MAXUSER_ADDRESS).

Simplity the pmap code to manage process contexts, we no longer have to
double map the UPAGES, this simplifies and should measuably speed up fork().

The following parts came from John Dyson:

Set PG_G on the UPAGES that are now in kernel context, and invalidate
them when swapping them out.

Move the upages object (upobj) from the vmspace to the proc structure.

Now that the UPAGES (pcb and kernel stack) are out of user space, make
rfork(..RFMEM..) do what was intended by sharing the vmspace
entirely via reference counting rather than simply inheriting the mappings.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 11282a57 11-Aug-1996 David Greenman <dg@FreeBSD.org>

Add support for i686 machine check trap.


# ee323f62 30-May-1996 Peter Wemm <peter@FreeBSD.org>

Jump some hoops to have the *.s code being able to be run through both an
ansi and traditional cpp.

The nesting rules of macros are different, which required some changes.
Use __CONCAT(x,y) instead of /**/.
Redo some comments to use /* */ rather than "# comment" because the ansi
cpp cares about those, and also cares about quote matching.


# a8c5fef5 02-May-1996 Poul-Henning Kamp <phk@FreeBSD.org>

KGDB is dead. It may come back one day if somebody does it.


# 22a31706 11-Apr-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Make alltraps a .globl so that DDB doesn't make people belive they have
an ALIGNFLT on their hands all the time.


# d66a5066 02-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Mega-commit for Linux emulator update.. This has been stress tested under
netscape-2.0 for Linux running all the Java stuff. The scrollbars are now
working, at least on my machine. (whew! :-)

I'm uncomfortable with the size of this commit, but it's too
inter-dependant to easily seperate out.

The main changes:

COMPAT_LINUX is *GONE*. Most of the code has been moved out of the i386
machine dependent section into the linux emulator itself. The int 0x80
syscall code was almost identical to the lcall 7,0 code and a minor tweak
allows them to both be used with the same C code. All kernels can now
just modload the lkm and it'll DTRT without having to rebuild the kernel
first. Like IBCS2, you can statically compile it in with "options LINUX".

A pile of new syscalls implemented, including getdents(), llseek(),
readv(), writev(), msync(), personality(). The Linux-ELF libraries want
to use some of these.

linux_select() now obeys Linux semantics, ie: returns the time remaining
of the timeout value rather than leaving it the original value.

Quite a few bugs removed, including incorrect arguments being used in
syscalls.. eg: mixups between passing the sigset as an int, vs passing
it as a pointer and doing a copyin(), missing return values, unhandled
cases, SIOC* ioctls, etc.

The build for the code has changed. i386/conf/files now knows how
to build linux_genassym and generate linux_assym.h on the fly.

Supporting changes elsewhere in the kernel:

The user-mode signal trampoline has moved from the U area to immediately
below the top of the stack (below PS_STRINGS). This allows the different
binary emulations to have their own signal trampoline code (which gets rid
of the hardwired syscall 103 (sigreturn on BSD, syslog on Linux)) and so
that the emulator can provide the exact "struct sigcontext *" argument to
the program's signal handlers.

The sigstack's "ss_flags" now uses SS_DISABLE and SS_ONSTACK flags, which
have the same values as the re-used SA_DISABLE and SA_ONSTACK which are
intended for sigaction only. This enables the support of a SA_RESETHAND
flag to sigaction to implement the gross SYSV and Linux SA_ONESHOT signal
semantics where the signal handler is reset when it's triggered.

makesyscalls.sh no longer appends the struct sysentvec on the end of the
generated init_sysent.c code. It's a lot saner to have it in a seperate
file rather than trying to update the structure inside the awk script. :-)

At exec time, the dozen bytes or so of signal trampoline code are copied
to the top of the user's stack, rather than obtaining the trampoline code
the old way by getting a clone of the parent's user area. This allows
Linux and native binaries to freely exec each other without getting
trampolines mixed up.


# 32831552 21-Dec-1995 David Greenman <dg@FreeBSD.org>

Rewrote most of the ddb stack traceback code. These changes are smarter
about decoding trap/syscall/interrupt frames and generally works better
than the previous stuff.
Removed some special (incorrect) frobbing of the frame pointer that
was messing some things up with the new traceback code.


# 2838c968 19-Dec-1995 David Greenman <dg@FreeBSD.org>

Implemented a (sorely needed for years) double fault handler to catch stack
overflows.
It sure would be nice if there was an unmapped page between the PCB and
the stack (and that the size of the stack was configurable!). With the
way things are now, the PCB will get clobbered before the double fault
handler gets control, making somewhat of a mess of things. Despite this,
it is still fairly easy to poke around in the overflowed stack to figure
out the cause.


# b1529bda 14-Dec-1995 Peter Wemm <peter@FreeBSD.org>

GENERIC/LINT: Remove redundant quoting on some option lines.
LINT: add a couple of new/missing/undocumented options
files.i386: add linux code so that you can compile a kernel with static
linux emulation ("options LINUX")
i386/*: use #if defined(COMPAT_LINUX) || defined(LINUX) to enable static
support of linux emulation (just like "IBCS2" makes ibcs2 static)

The main thing this is going to make obvious, is that the LINUX code
(when compiled from LINT) has a lot of warnings, some of which dont look
too pleasant..


# 0e1815bb 07-Sep-1995 David Greenman <dg@FreeBSD.org>

Minor cleanup and (very) small micro optimization to Xsyscall (and the
linux one)..


# a722b904 15-Aug-1995 Bruce Evans <bde@FreeBSD.org>

Fake a call frame for traps so that `gdb -k' can report where fatal
traps occurred. This also helps ddb backtrace through trap frames.
Backtracing through syscall and interrupt frames still doesn't work
but it is relatively unimportant and more expensive to fix.


# d3628763 11-Jun-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Merge RELENG_2_0_5 into HEAD


# 1e1e0b44 14-Feb-1995 Søren Schmidt <sos@FreeBSD.org>

First attempt to run linux binaries. This is only the changes needed to
the generic kernel. The actual emulator is a separate LKM. (not finished
yet, sorry).
Submitted by: sos@freebsd.org & sef@kithrup.com


# 20415301 14-Jan-1995 Bruce Evans <bde@FreeBSD.org>

Fix security holes in sigreturn(), ptrace() and procfs. sigreturn()
attempted to check for insecure and fatal eflags and segment
selectors, but missed many cases and got the IOPL check back to
front. The other syscalls didn't check at all.

sys_process.c, machdep.c:
Only allow PT_WRITE_U to write to the registers (ordinary and FP).

psl.h, locore.s, machdep.c:
Eliminate PSL_MBZ, PSL_MBO and PSL_USERCLR. We are not supposed
to assume anything about the reserved bits. Use PSL_USERCHANGE
and PSL_KERNEL instead. Rename PSL_USERSET to PSL_USER.

exception.s:
Define a private label for use by doreti when returning to user
mode fails.

machdep.c:
In syscalls, allow changing only the eflags that can be changed on
486's in user mode (no longer attempt to allow benign IOPL changes;
allow changing the nasty PSL_NT; don't allow changing the i586
bits).

Don't attempt to check all the cases involving invalid selectors
and %eip's. Just check for privilege violations and let the invalid
things cause a trap.

procfs_machdep.c:
Call the ptrace register functions to do all the work for reading
and writing ordinary registers and for single stepping.

trap.c:
Ignore traps caused by PSL_NT being set. Previously, users could
cause a fatal trap in user mode by setting PSL_NT and executing an
iret, and a fatal trap in kernel mode by setting PSL_NT and making
a syscall. PSL_NT was cleared too late and not in enough modes to
fix the problem.

Make all traps in user mode (except T_NMI) nonfatal.

Recover from traps caused by attempting to load invalid user
registers in doreti by restarting the traps so that they appear to
occur in user mode.
---

Fix bogons that I noticed while fixing the above:

psl.h:
Fix some comments.

Uniformize idempotency ifdef.

exception.s, machdep.c:
Remove rsvd[0-14]. rsvd0 hasn't been reserved since the 486 came
out. Replace rsvd0 by `align'. rsvd[0-11] used wrong (magic
non-unique) trap numbers. Replace rsvd[1-14] by rsvd.

locore.s:
Enable alignment check flag on 486's and 586's.

machdep.c:
Use a better type for kstack[].

Use TFREGP() to find the registers.

Reformat ptrace functions from SEF to something closer to KNF.

procfs_machdep.c:
The wrong pointer to the registers got fixed as a side effect.

Implement reading and writing of FP registers.

/proc/*/*regs now work (only) for processes that are in memory.

Clean up comments.

trap.c, trap.h:
Remove unused trap types.


# b39b673d 03-Dec-1994 Bruce Evans <bde@FreeBSD.org>

i386/exception.s,
Keep track of interrupt nesting level. It is normally 0
for syscalls and traps, but is fudged to 1 for their exit
processing in case they metamorphose into an interrupt
handler.

i386/genassym.c;
Remove support for the obsolete pcb_iml and pcb_cmap2.

Add support for pcb_inl.

i386/swtch.s:
Fudge the interrupt nesting level across context switches and in
the idle loop so that the work for preemptive context switches
gets counted as interrupt time, the work for voluntary context
switches gets counted mostly as system time (the part when
curproc == 0 gets counted as interrupt time), and only truly idle
time gets counted as idle time.

Remove obsolete support (commented out and otherwise) for pcb_iml.

Load curpcb just before curproc instead of just after so that
curpcb is always valid if curproc is. A few more changes like
this may fix tracing through context switches.

Remove obsolete function swtch_to_inactive().

include/cpu.h:
Use the new interrupt nesting level variable to implement a
non-fake CLF_INTR() so that accounting for the interrupt state
works.

You can use top, iostat or (best) an up to date systat to see
interrupt overheads. I see the expected huge interrupt overheads
for ISA devices (on a 486DX/33, about 55% for an IDE drive
transferring 1250K/sec and the same for a WD8013EBT network card
transferring 1100K/sec). The huge interrupt overheads for serial
devices are unfortunately normally invisible.

include/pcb.h:
Remove the obsolete pcb_iml and pcb_cmap2. Replace them by
padding to preserve binary compatibility.

Use part of the new padding for pcb_inl.

isa/icu.s:
isa/vector.s:
Keep track of interrupt nesting level.


# e04520ce 27-Sep-1994 Bruce Evans <bde@FreeBSD.org>

Ensure normal selection and alignment of the text and data sections before
including files. vector.s sometimes left the data section misaligned
(depending on the configuration) so all the time-critical globals in icu.s
were sometimes misaligned.


# f540b106 12-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Change all #includes to follow the current Berkeley style. Some of these
``changes'' are actually not changes at all, but CVS sometimes has trouble
telling the difference.

This also includes support for second-directory compiles. This is not
quite complete yet, as `config' doesn't yet do the right thing. You can
still make it work trivially, however, by doing the following:

rm /sys/compile
mkdir /usr/obj/sys/compile
ln -s M-. /sys/compile
cd /sys/i386/conf
config MYKERNEL
cd ../../compile/MYKERNEL
ln -s /sys @
rm machine
ln -s @/i386/include machine
make depend
make


# d2306226 02-Apr-1994 David Greenman <dg@FreeBSD.org>

New interrupt code from Bruce Evans. In additional to Bruce's attached
list of changes, I've made the following additional changes:

1) i386/include/ipl.h renamed to spl.h as the name conflicts with the
file of the same name in i386/isa/ipl.h.
2) changed all use of *mask (i.e. netmask, biomask, ttymask, etc) to
*_imask (net_imask, etc).
3) changed vestige of splnet use in if_is to splimp.
4) got rid of "impmask" completely (Bruce had gotten rid of netmask),
and are now using net_imask instead.
5) dozens of minor cruft to glue in Bruce's changes.

These require changes I made to config(8) as well, and thus it must
be rebuilt.

-DG

from Bruce Evans:

sio:
o No diff is supplied. Remove the define of setsofttty(). I hope
that is enough.

*.s:
o i386/isa/debug.h no longer exists. The event counters became too
much trouble to maintain. All function call entry and exception
entry counters can be recovered by using profiling kernel (the new
profiling supports all entry points; however, it is too slow to
leave enabled all the time; it also). Only BDBTRAP() from debug.h
is now used. That is moved to exception.s. It might be worth
preserving SHOW_BITS() and calling it from _mcount() (if enabled).
o T_ASTFLT is now only set just before calling trap().
o All exception handlers set SWI_AST_MASK in cpl as soon as possible
after entry and arrange for _doreti to restore it atomically with
exiting. It is not possible to set it atomically with entering
the kernel, so it must be checked against the user mode bits in
the trap frame before committing to using it. There is no place
to store the old value of cpl for syscalls or traps, so there are
some complications restoring it.

Profiling stuff (mostly in *.s):
o Changes to kern/subr_mcount.c, gcc and gprof are not supplied yet.
o All interesting labels `foo' are renamed `_foo' and all
uninteresting labels `_bar' are renamed `bar'. A small change
to gprof allows ignoring labels not starting with underscores.
o MCOUNT_LABEL() is to provide names for counters for times spent
in exception handlers.
o FAKE_MCOUNT() is a version of MCOUNT() suitable for exception
handlers. Its arg is the pc where the exception occurred. The
new mcount() pretends that this was a call from that pc to a
suitable MCOUNT_LABEL().
o MEXITCOUNT is to turn off any timer started by MCOUNT().

/usr/src/sys/i386/i386/exception.s:
o The non-BDB BPTTRAP() macros were doing a sti even when interrupts
were disabled when the trap occurred. The sti (fixed) sti is
actually a no-op unless you have my changes to machdep.c that make
the debugger trap gates interrupt gates, but fixing that would
make the ifdefs messier. ddb seems to be unharmed by both
interrupts always disabled and always enabled (I had the branch in
the fix back to front for some time :-().
o There is no known pushal bug.
o tf_err can be left as garbage for syscalls.

/usr/src/sys/i386/i386/locore.s:
o Fix and update BDE_DEBUGGER support.
o ENTRY(btext) before initialization was dangerous.
o Warm boot shot was longer than intended.

/usr/src/sys/i386/i386/machdep.c:
o DON'T APPLY ALL OF THIS DIFF. It's what I'm using, but may require
other changes.
Use the following:
o Remove aston() and setsoftclock().
Maybe use the following:
o No netisr.h.
o Spelling fix.
o Delay to read the Rebooting message.
o Fix for vm system unmapping a reduced area of memory
after bounds_check_with_label() reduces the size of
a physical i/o for a partition boundary. A similar
fix is required in kern_physio.c.
o Correct use of __CONCAT. It never worked here for non-
ANSI cpp's. Is it time to drop support for non-ANSI?
o gdt_segs init. 0xffffffffUL is bogus because ssd_limit
is not 32 bits. The replacement may have the same
value :-), but is more natural.
o physmem was one page too low. Confusing variable names.
Don't use the following:
o Better numbers of buffers. Each 8K page requires up to
16 buffer headers. On my system, this results in 5576
buffers containing [up to] 2854912 bytes of memory.
The usual allocation of about 384 buffers only holds
192K of disk if you use it on an fs with a block size
of 512.
o gdt changes for bdb.
o *TGT -> *IDT changes for bdb.
o #ifdefed changes for bdb.

/usr/src/sys/i386/i386/microtime.s:
o Use the correct asm macros. I think asm.h was copied from Mach
just for microtime and isn't used now. It certainly doesn't
belong in <sys>. Various macros are also duplicated in
sys/i386/boot.h and libc/i386/*.h.
o Don't switch to and from the IRR; it is guaranteed to be selected
(default after ICU init and explicitly selected in isa.c too, and
never changed until the old microtime clobbered it).

/usr/src/sys/i386/i386/support.s:
o Non-essential changes (none related to spls or profiling).
o Removed slow loads of %gs again. The LDT support may require
not relying on %gs, but loading it is not the way to fix it!
Some places (copyin ...) forgot to load it. Loading it clobbers
the user %gs. trap() still loads it after certain types of
faults so that fuword() etc can rely on it without loading it
explicitly. Exception handlers don't restore it. If we want
to preserve the user %gs, then the fastest method is to not
touch it except for context switches. Comparing with
VM_MAXUSER_ADDRESS and branching takes only 2 or 4 cycles on
a 486, while loading %gs takes 9 cycles and using it takes
another.
o Fixed a signed branch to unsigned.

/usr/src/sys/i386/i386/swtch.s:
o Move spl0() outside of idle loop.
o Remove cli/sti from idle loop. sw1 does a cli, and in the
unlikely event of an interrupt occurring and whichqs becoming
zero, sw1 will just jump back to _idle.
o There's no spl0() function in asm any more, so use splz().
o swtch() doesn't need to be superaligned, at least with the
new mcounting.
o Fixed a signed branch to unsigned.
o Removed astoff().

/usr/src/sys/i386/i386/trap.c:
o The decentralized extern decls were inconsistent, of course.
o Fixed typo MATH_EMULTATE in comments. */
o Removed unused variables.
o Old netmask is now impmask; print it instead. Perhaps we
should print some of the new masks.
o BTW, trap() should not print anything for normal debugger
traps.

/usr/src/sys/i386/include/asmacros.h:
o DON'T APPLY ALL OF THIS DIFF. Just use some of the null macros
as necessary.

/usr/src/sys/i386/include/cpu.h:
o CLKF_BASEPRI() changes since cpl == SWI_AST_MASK is now normal
while the kernel is running.
o Don't use var++ to set boolean variables. It fails after a mere
4G times :-) and is slower than storing a constant on [3-4]86s.

/usr/src/sys/i386/include/cpufunc.h:
o DON'T APPLY ALL OF THIS DIFF. You need mainly the include of
<machine/ipl.h>. Unfortunately, <machine/ipl.h> is needed by
almost everything for the inlines.

/usr/src/sys/i386/include/ipl.h:
o New file. Defines spl inlines and SWI macros and declares most
variables related to hard and soft interrupt masks.

/usr/src/sys/i386/isa/icu.h:
o Moved definitions to <machine/ipl.h>

/usr/src/sys/i386/isa/icu.s:
o Software interrupts (SWIs) and delayed hardware interrupts (HWIs)
are now handled uniformally, and dispatching them from splx() is
more like dispatching them from _doreti. The dispatcher is
essentially *(handler[ffs(ipending & ~cpl)]().
o More care (not quite enough) is taken to avoid unbounded nesting
of interrupts.
o The interface to softclock() is changed so that a trap frame is
not required.
o Fast interrupt handlers are now handled more uniformally.
Configuration is still too early (new handlers would require
bits in <machine/ipl.h> and functions to vector.s).
o splnnn() and splx() are no longer here; they are inline functions
(could be macros for other compilers). splz() is the nontrivial
part of the old splx().

/usr/src/sys/i386/isa/ipl.h
o New file. Supposed to have only bus-dependent stuff. Perhaps
the h/w masks should be declared here.

/usr/src/sys/i386/isa/isa.c:
o DON'T APPLY ALL OF THIS DIFF. You need only things involving
*mask and *MASK and comments about them. netmask is now a pure
software mask. It works like the softclock mask.

/usr/src/sys/i386/isa/vector.s:
o Reorganize AUTO_EOI* macros.
o Option FAST_INTR_HANDLER_USERS_ES for people who don't trust
fastintr handlers.
o fastintr handlers need to metamorphose into ordinary interrupt
handlers if their SWI bit has become set. Previously, sio had
unintended latency for handling output completions and input
of SLIP framing characters because this was not done.

/usr/src/sys/net/netisr.h:
o The machine-dependent stuff is now imported from <machine/ipl.h>.

/usr/src/sys/sys/systm.h
o DON'T APPLY ALL OF THIS DIFF. You need mainly the different
splx() prototype. The spl*() prototypes are duplicated as
inlines in <machine/ipl.h> but they need to be duplicated here
in case there are no inlines. I sent systm.h and cpufunc.h
to Garrett. We agree that spl0 should be replaced by splnone
and not the other way around like I've done.

/usr/src/sys/kern/kern_clock.c
o splsoftclock() now lowers cpl so the direct call to softclock()
works as intended.
o softclock() interface changed to avoid passing the whole frame
(some machines may need another change for profile_tick()).
o profiling renamed _profiling to avoid ANSI namespace pollution.
(I had to improve the mcount() interface and may as well fix it.)
The GUPROF variant doesn't actually reference profiling here,
but the 'U' in GUPROF should mean to select the microtimer
mcount() and not change the interface.


# c8a13ecd 03-Jan-1994 David Greenman <dg@FreeBSD.org>

Convert syscall to trapframe. Based on work done by John Brezak.


# 0967373e 12-Nov-1993 David Greenman <dg@FreeBSD.org>

First steps in rewriting locore.s, and making info useful
when the machine panics.

i386/i386/locore.s:
1) got rid of most .set directives that were being used like
#define's, and replaced them with appropriate #define's in
the appropriate header files (accessed via genassym).
2) added comments to header inclusions and global definitions,
and global variables
3) replaced some hardcoded constants with cpp defines (such as
PDESIZE and others)
4) aligned all comments to the same column to make them easier to
read
5) moved macro definitions for ENTRY, ALIGN, NOP, etc. to
/sys/i386/include/asmacros.h
6) added #ifdef BDE_DEBUGGER around all of Bruce's debugger code
7) added new global '_KERNend' to store last location+1 of kernel
8) cleaned up zeroing of bss so that only bss is zeroed
9) fix zeroing of page tables so that it really does zero them all
- not just if they follow the bss.
10) rewrote page table initialization code so that 1) works correctly
and 2) write protects the kernel text by default
11) properly initialize the kernel page directory, upages, p0stack PT,
and page tables. The previous scheme was more than a bit
screwy.
12) change allocation of virtual area of IO hole so that it is
fixed at KERNBASE + 0xa0000. The previous scheme put it
right after the kernel page tables and then later expected
it to be at KERNBASE +0xa0000
13) change multiple bogus settings of user read/write of various
areas of kernel VM - including the IO hole; we should never
be accessing the IO hole in user mode through the kernel
page tables
14) split kernel support routines such as bcopy, bzero, copyin,
copyout, etc. into a seperate file 'support.s'
15) split swtch and related routines into a seperate 'swtch.s'
16) split routines related to traps, syscalls, and interrupts
into a seperate file 'exception.s'
17) remove some unused global variables from locore that got
inserted by Garrett when he pulled them out of some .h
files.

i386/isa/icu.s:
1) clean up global variable declarations
2) move in declaration of astpending and netisr

i386/i386/pmap.c:
1) fix calculation of virtual_avail. It previously was calculated
to be right in the middle of the kernel page tables - not
a good place to start allocating kernel VM.
2) properly allocate kernel page dir/tables etc out of kernel map
- previously only took out 2 pages.

i386/i386/machdep.c:
1) modify boot() to print a warning that the system will reboot in
PANIC_REBOOT_WAIT_TIME amount of seconds, and let the user
abort with a key on the console. The machine will wait for
ever if a key is typed before the reboot. The default is
15 seconds, but can be set to 0 to mean don't wait at all,
-1 to mean wait forever, or any positive value to wait for
that many seconds.
2) print "Rebooting..." just before doing it.

kern/subr_prf.c:
1) remove PANICWAIT as it is deprecated by the change to machdep.c

i386/i386/trap.c:
1) add table of trap type strings and use it to print a real trap/
panic message rather than just a number. Lot's of work to
be done here, but this is the first step. Symbolic traceback
is in the TODO.

i386/i386/Makefile.i386:
1) add support in to build support.s, exception.s and swtch.s

...and various changes to various header files to make all of the
above happen.