History log of /netbsd-current/sys/arch/x86/x86/fpu.c
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.89 21-Jun-2024 riastradh

x86/fpu.c: Nix trailing whitespace.

No functional change intended.


# 1.88 17-May-2024 manu

iWorkaround panic: fpudna from userland

i386 Xen PV domU get spurious fpudna traps from userland. Older eager FPU
contact switching code took care of ignoring them. When transitioning
from eager switching to awlays switching, this special handling was
removed, causing "fpudna from userland" panics.

This change restores the previosu behavior where fpudna traps from
userland are ignored on Xen PV domU.


Revision tags: thorpej-ifq-base thorpej-altq-separation-base
# 1.87 18-Jul-2023 riastradh

x86/fpu: In kernel mode fpu traps, print the instruction pointer.


# 1.86 03-Mar-2023 riastradh

x86/fpu: Align savefpu to 64 bytes in fpuinit_mxcsr_mask.

16 bytes is not enough.

(Is this why it never worked on Xen some years back? Got lucky and
accidentally had 64-byte alignment on native x86, but not in the call
stack in Xen?)

XXX pullup-10


# 1.85 03-Mar-2023 riastradh

Revert "x86: Add kthread_fpu_enter/exit support, take two."

kthread_fpu_enter/exit changes broke some hardware, unclear why, to
investigate before fixing and reapplying these changes.


# 1.84 03-Mar-2023 riastradh

Revert "x86/fpu.c: Sprinkle KNF."

kthread_fpu_enter/exit changes broke some hardware, unclear why, to
investigate before fixing and reapplying these changes.


# 1.83 25-Feb-2023 riastradh

x86/fpu.c: Sprinkle KNF.

No functional change intended.


# 1.82 25-Feb-2023 riastradh

x86: Add kthread_fpu_enter/exit support, take two.

This time, make sure to restore the FPU state when switching to a
kthread in the middle of kthread_fpu_enter/exit.

This adds a single predicted-taken branch for the case of kthreads
that are not in kthread_fpu_enter/exit, so it incurs a penalty only
for threads that actually use it. Since it avoids FPU state
switching in kthreads that do use the FPU, namely cgd worker threads,
this should be a net performance win on systems using it and have
negligible impact otherwise.

XXX pullup-10


# 1.81 25-Feb-2023 riastradh

x86: Label boolean is_64bit argument to fpu_area_restore.

No functional change intended.


# 1.80 25-Feb-2023 riastradh

x86: Mitigate MXCSR Configuration Dependent Timing in kernel FPU use.

In fpu_kern_enter, make sure all the MXCSR exception status bits are
set when we start using the FPU, so that instructions which exhibit
MCDT are unaffected by it.

While here, zero all the other FPU registers in fpu_kern_enter.

In principle we could skip this step on future CPUs that fix the MCDT
bug, but there's probably not much benefit -- workloads that do a lot
of crypto in the kernel are probably better off using
kthread_fpu_enter or WQ_FPU to skip the fpu_kern_enter/leave cycles
in the first place.

For details, see:
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/mxcsr-configuration-dependent-timing.html


Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.79 20-Aug-2022 riastradh

branches: 1.79.4;
fpu_kern_enter/leave: Disable IPL assertions.

These don't work because mutex_enter/exit on a spin lock may raise an
IPL but not lower it, if another spin lock was already held. For
example,

mutex_enter(some_lock_at_IPL_VM);
printf("foo\n");
fpu_kern_enter();
...
fpu_kern_leave();
mutex_exit(some_lock_at_IPL_VM);

will trigger the panic, because printf takes a lock at IPL_HIGH where
the IPL wil remain until the mutex_exit. (This was a nightmare to
track down before I remembered that detail of spin lock IPL
semantics...)


# 1.78 24-May-2022 andvar

fix various typos in comments, docs and log messages.


# 1.77 01-Apr-2022 riastradh

x86, arm: Allow fpu_kern_enter/leave while cold.

Normally these are forbidden above IPL_VM, so that FPU usage doesn't
block IPL_SCHED or IPL_HIGH interrupts. But while cold, e.g. during
builtin module initialization at boot, all interrupts are blocked
anyway so it's a moot point.

Also initialize x86 cpu_info_primary.ci_kfpu_spl to -1 so we don't
trip over an assertion about it while cold -- the assertion is meant
to detect reentrance into fpu_kern_enter/leave, which is prohibited.

Also initialize cpu0's ci_kfpu_spl.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.76 24-Oct-2020 mgorny

Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).


# 1.75 15-Oct-2020 mgorny

Revert "Merge convert_xmm_s87.c into fpu.c"

I am going to add ATF tests for these two functions, and having them
in a separate file will make it more convenient to build and run them
in userspace.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

branches: 1.55.2;
More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.88 17-May-2024 manu

iWorkaround panic: fpudna from userland

i386 Xen PV domU get spurious fpudna traps from userland. Older eager FPU
contact switching code took care of ignoring them. When transitioning
from eager switching to awlays switching, this special handling was
removed, causing "fpudna from userland" panics.

This change restores the previosu behavior where fpudna traps from
userland are ignored on Xen PV domU.


Revision tags: thorpej-ifq-base thorpej-altq-separation-base
# 1.87 18-Jul-2023 riastradh

x86/fpu: In kernel mode fpu traps, print the instruction pointer.


# 1.86 03-Mar-2023 riastradh

x86/fpu: Align savefpu to 64 bytes in fpuinit_mxcsr_mask.

16 bytes is not enough.

(Is this why it never worked on Xen some years back? Got lucky and
accidentally had 64-byte alignment on native x86, but not in the call
stack in Xen?)

XXX pullup-10


# 1.85 03-Mar-2023 riastradh

Revert "x86: Add kthread_fpu_enter/exit support, take two."

kthread_fpu_enter/exit changes broke some hardware, unclear why, to
investigate before fixing and reapplying these changes.


# 1.84 03-Mar-2023 riastradh

Revert "x86/fpu.c: Sprinkle KNF."

kthread_fpu_enter/exit changes broke some hardware, unclear why, to
investigate before fixing and reapplying these changes.


# 1.83 25-Feb-2023 riastradh

x86/fpu.c: Sprinkle KNF.

No functional change intended.


# 1.82 25-Feb-2023 riastradh

x86: Add kthread_fpu_enter/exit support, take two.

This time, make sure to restore the FPU state when switching to a
kthread in the middle of kthread_fpu_enter/exit.

This adds a single predicted-taken branch for the case of kthreads
that are not in kthread_fpu_enter/exit, so it incurs a penalty only
for threads that actually use it. Since it avoids FPU state
switching in kthreads that do use the FPU, namely cgd worker threads,
this should be a net performance win on systems using it and have
negligible impact otherwise.

XXX pullup-10


# 1.81 25-Feb-2023 riastradh

x86: Label boolean is_64bit argument to fpu_area_restore.

No functional change intended.


# 1.80 25-Feb-2023 riastradh

x86: Mitigate MXCSR Configuration Dependent Timing in kernel FPU use.

In fpu_kern_enter, make sure all the MXCSR exception status bits are
set when we start using the FPU, so that instructions which exhibit
MCDT are unaffected by it.

While here, zero all the other FPU registers in fpu_kern_enter.

In principle we could skip this step on future CPUs that fix the MCDT
bug, but there's probably not much benefit -- workloads that do a lot
of crypto in the kernel are probably better off using
kthread_fpu_enter or WQ_FPU to skip the fpu_kern_enter/leave cycles
in the first place.

For details, see:
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/mxcsr-configuration-dependent-timing.html


Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.79 20-Aug-2022 riastradh

branches: 1.79.4;
fpu_kern_enter/leave: Disable IPL assertions.

These don't work because mutex_enter/exit on a spin lock may raise an
IPL but not lower it, if another spin lock was already held. For
example,

mutex_enter(some_lock_at_IPL_VM);
printf("foo\n");
fpu_kern_enter();
...
fpu_kern_leave();
mutex_exit(some_lock_at_IPL_VM);

will trigger the panic, because printf takes a lock at IPL_HIGH where
the IPL wil remain until the mutex_exit. (This was a nightmare to
track down before I remembered that detail of spin lock IPL
semantics...)


# 1.78 24-May-2022 andvar

fix various typos in comments, docs and log messages.


# 1.77 01-Apr-2022 riastradh

x86, arm: Allow fpu_kern_enter/leave while cold.

Normally these are forbidden above IPL_VM, so that FPU usage doesn't
block IPL_SCHED or IPL_HIGH interrupts. But while cold, e.g. during
builtin module initialization at boot, all interrupts are blocked
anyway so it's a moot point.

Also initialize x86 cpu_info_primary.ci_kfpu_spl to -1 so we don't
trip over an assertion about it while cold -- the assertion is meant
to detect reentrance into fpu_kern_enter/leave, which is prohibited.

Also initialize cpu0's ci_kfpu_spl.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.76 24-Oct-2020 mgorny

Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).


# 1.75 15-Oct-2020 mgorny

Revert "Merge convert_xmm_s87.c into fpu.c"

I am going to add ATF tests for these two functions, and having them
in a separate file will make it more convenient to build and run them
in userspace.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

branches: 1.55.2;
More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.87 18-Jul-2023 riastradh

x86/fpu: In kernel mode fpu traps, print the instruction pointer.


# 1.86 03-Mar-2023 riastradh

x86/fpu: Align savefpu to 64 bytes in fpuinit_mxcsr_mask.

16 bytes is not enough.

(Is this why it never worked on Xen some years back? Got lucky and
accidentally had 64-byte alignment on native x86, but not in the call
stack in Xen?)

XXX pullup-10


# 1.85 03-Mar-2023 riastradh

Revert "x86: Add kthread_fpu_enter/exit support, take two."

kthread_fpu_enter/exit changes broke some hardware, unclear why, to
investigate before fixing and reapplying these changes.


# 1.84 03-Mar-2023 riastradh

Revert "x86/fpu.c: Sprinkle KNF."

kthread_fpu_enter/exit changes broke some hardware, unclear why, to
investigate before fixing and reapplying these changes.


# 1.83 25-Feb-2023 riastradh

x86/fpu.c: Sprinkle KNF.

No functional change intended.


# 1.82 25-Feb-2023 riastradh

x86: Add kthread_fpu_enter/exit support, take two.

This time, make sure to restore the FPU state when switching to a
kthread in the middle of kthread_fpu_enter/exit.

This adds a single predicted-taken branch for the case of kthreads
that are not in kthread_fpu_enter/exit, so it incurs a penalty only
for threads that actually use it. Since it avoids FPU state
switching in kthreads that do use the FPU, namely cgd worker threads,
this should be a net performance win on systems using it and have
negligible impact otherwise.

XXX pullup-10


# 1.81 25-Feb-2023 riastradh

x86: Label boolean is_64bit argument to fpu_area_restore.

No functional change intended.


# 1.80 25-Feb-2023 riastradh

x86: Mitigate MXCSR Configuration Dependent Timing in kernel FPU use.

In fpu_kern_enter, make sure all the MXCSR exception status bits are
set when we start using the FPU, so that instructions which exhibit
MCDT are unaffected by it.

While here, zero all the other FPU registers in fpu_kern_enter.

In principle we could skip this step on future CPUs that fix the MCDT
bug, but there's probably not much benefit -- workloads that do a lot
of crypto in the kernel are probably better off using
kthread_fpu_enter or WQ_FPU to skip the fpu_kern_enter/leave cycles
in the first place.

For details, see:
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/mxcsr-configuration-dependent-timing.html


Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.79 20-Aug-2022 riastradh

fpu_kern_enter/leave: Disable IPL assertions.

These don't work because mutex_enter/exit on a spin lock may raise an
IPL but not lower it, if another spin lock was already held. For
example,

mutex_enter(some_lock_at_IPL_VM);
printf("foo\n");
fpu_kern_enter();
...
fpu_kern_leave();
mutex_exit(some_lock_at_IPL_VM);

will trigger the panic, because printf takes a lock at IPL_HIGH where
the IPL wil remain until the mutex_exit. (This was a nightmare to
track down before I remembered that detail of spin lock IPL
semantics...)


# 1.78 24-May-2022 andvar

fix various typos in comments, docs and log messages.


# 1.77 01-Apr-2022 riastradh

x86, arm: Allow fpu_kern_enter/leave while cold.

Normally these are forbidden above IPL_VM, so that FPU usage doesn't
block IPL_SCHED or IPL_HIGH interrupts. But while cold, e.g. during
builtin module initialization at boot, all interrupts are blocked
anyway so it's a moot point.

Also initialize x86 cpu_info_primary.ci_kfpu_spl to -1 so we don't
trip over an assertion about it while cold -- the assertion is meant
to detect reentrance into fpu_kern_enter/leave, which is prohibited.

Also initialize cpu0's ci_kfpu_spl.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.76 24-Oct-2020 mgorny

Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).


# 1.75 15-Oct-2020 mgorny

Revert "Merge convert_xmm_s87.c into fpu.c"

I am going to add ATF tests for these two functions, and having them
in a separate file will make it more convenient to build and run them
in userspace.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

branches: 1.55.2;
More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.86 03-Mar-2023 riastradh

x86/fpu: Align savefpu to 64 bytes in fpuinit_mxcsr_mask.

16 bytes is not enough.

(Is this why it never worked on Xen some years back? Got lucky and
accidentally had 64-byte alignment on native x86, but not in the call
stack in Xen?)

XXX pullup-10


# 1.85 03-Mar-2023 riastradh

Revert "x86: Add kthread_fpu_enter/exit support, take two."

kthread_fpu_enter/exit changes broke some hardware, unclear why, to
investigate before fixing and reapplying these changes.


# 1.84 03-Mar-2023 riastradh

Revert "x86/fpu.c: Sprinkle KNF."

kthread_fpu_enter/exit changes broke some hardware, unclear why, to
investigate before fixing and reapplying these changes.


# 1.83 25-Feb-2023 riastradh

x86/fpu.c: Sprinkle KNF.

No functional change intended.


# 1.82 25-Feb-2023 riastradh

x86: Add kthread_fpu_enter/exit support, take two.

This time, make sure to restore the FPU state when switching to a
kthread in the middle of kthread_fpu_enter/exit.

This adds a single predicted-taken branch for the case of kthreads
that are not in kthread_fpu_enter/exit, so it incurs a penalty only
for threads that actually use it. Since it avoids FPU state
switching in kthreads that do use the FPU, namely cgd worker threads,
this should be a net performance win on systems using it and have
negligible impact otherwise.

XXX pullup-10


# 1.81 25-Feb-2023 riastradh

x86: Label boolean is_64bit argument to fpu_area_restore.

No functional change intended.


# 1.80 25-Feb-2023 riastradh

x86: Mitigate MXCSR Configuration Dependent Timing in kernel FPU use.

In fpu_kern_enter, make sure all the MXCSR exception status bits are
set when we start using the FPU, so that instructions which exhibit
MCDT are unaffected by it.

While here, zero all the other FPU registers in fpu_kern_enter.

In principle we could skip this step on future CPUs that fix the MCDT
bug, but there's probably not much benefit -- workloads that do a lot
of crypto in the kernel are probably better off using
kthread_fpu_enter or WQ_FPU to skip the fpu_kern_enter/leave cycles
in the first place.

For details, see:
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/mxcsr-configuration-dependent-timing.html


Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.79 20-Aug-2022 riastradh

fpu_kern_enter/leave: Disable IPL assertions.

These don't work because mutex_enter/exit on a spin lock may raise an
IPL but not lower it, if another spin lock was already held. For
example,

mutex_enter(some_lock_at_IPL_VM);
printf("foo\n");
fpu_kern_enter();
...
fpu_kern_leave();
mutex_exit(some_lock_at_IPL_VM);

will trigger the panic, because printf takes a lock at IPL_HIGH where
the IPL wil remain until the mutex_exit. (This was a nightmare to
track down before I remembered that detail of spin lock IPL
semantics...)


# 1.78 24-May-2022 andvar

fix various typos in comments, docs and log messages.


# 1.77 01-Apr-2022 riastradh

x86, arm: Allow fpu_kern_enter/leave while cold.

Normally these are forbidden above IPL_VM, so that FPU usage doesn't
block IPL_SCHED or IPL_HIGH interrupts. But while cold, e.g. during
builtin module initialization at boot, all interrupts are blocked
anyway so it's a moot point.

Also initialize x86 cpu_info_primary.ci_kfpu_spl to -1 so we don't
trip over an assertion about it while cold -- the assertion is meant
to detect reentrance into fpu_kern_enter/leave, which is prohibited.

Also initialize cpu0's ci_kfpu_spl.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.76 24-Oct-2020 mgorny

Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).


# 1.75 15-Oct-2020 mgorny

Revert "Merge convert_xmm_s87.c into fpu.c"

I am going to add ATF tests for these two functions, and having them
in a separate file will make it more convenient to build and run them
in userspace.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

branches: 1.55.2;
More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.83 25-Feb-2023 riastradh

x86/fpu.c: Sprinkle KNF.

No functional change intended.


# 1.82 25-Feb-2023 riastradh

x86: Add kthread_fpu_enter/exit support, take two.

This time, make sure to restore the FPU state when switching to a
kthread in the middle of kthread_fpu_enter/exit.

This adds a single predicted-taken branch for the case of kthreads
that are not in kthread_fpu_enter/exit, so it incurs a penalty only
for threads that actually use it. Since it avoids FPU state
switching in kthreads that do use the FPU, namely cgd worker threads,
this should be a net performance win on systems using it and have
negligible impact otherwise.

XXX pullup-10


# 1.81 25-Feb-2023 riastradh

x86: Label boolean is_64bit argument to fpu_area_restore.

No functional change intended.


# 1.80 25-Feb-2023 riastradh

x86: Mitigate MXCSR Configuration Dependent Timing in kernel FPU use.

In fpu_kern_enter, make sure all the MXCSR exception status bits are
set when we start using the FPU, so that instructions which exhibit
MCDT are unaffected by it.

While here, zero all the other FPU registers in fpu_kern_enter.

In principle we could skip this step on future CPUs that fix the MCDT
bug, but there's probably not much benefit -- workloads that do a lot
of crypto in the kernel are probably better off using
kthread_fpu_enter or WQ_FPU to skip the fpu_kern_enter/leave cycles
in the first place.

For details, see:
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/mxcsr-configuration-dependent-timing.html


Revision tags: netbsd-10-base bouyer-sunxi-drm-base
# 1.79 20-Aug-2022 riastradh

fpu_kern_enter/leave: Disable IPL assertions.

These don't work because mutex_enter/exit on a spin lock may raise an
IPL but not lower it, if another spin lock was already held. For
example,

mutex_enter(some_lock_at_IPL_VM);
printf("foo\n");
fpu_kern_enter();
...
fpu_kern_leave();
mutex_exit(some_lock_at_IPL_VM);

will trigger the panic, because printf takes a lock at IPL_HIGH where
the IPL wil remain until the mutex_exit. (This was a nightmare to
track down before I remembered that detail of spin lock IPL
semantics...)


# 1.78 24-May-2022 andvar

fix various typos in comments, docs and log messages.


# 1.77 01-Apr-2022 riastradh

x86, arm: Allow fpu_kern_enter/leave while cold.

Normally these are forbidden above IPL_VM, so that FPU usage doesn't
block IPL_SCHED or IPL_HIGH interrupts. But while cold, e.g. during
builtin module initialization at boot, all interrupts are blocked
anyway so it's a moot point.

Also initialize x86 cpu_info_primary.ci_kfpu_spl to -1 so we don't
trip over an assertion about it while cold -- the assertion is meant
to detect reentrance into fpu_kern_enter/leave, which is prohibited.

Also initialize cpu0's ci_kfpu_spl.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.76 24-Oct-2020 mgorny

Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).


# 1.75 15-Oct-2020 mgorny

Revert "Merge convert_xmm_s87.c into fpu.c"

I am going to add ATF tests for these two functions, and having them
in a separate file will make it more convenient to build and run them
in userspace.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

branches: 1.55.2;
More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.79 20-Aug-2022 riastradh

fpu_kern_enter/leave: Disable IPL assertions.

These don't work because mutex_enter/exit on a spin lock may raise an
IPL but not lower it, if another spin lock was already held. For
example,

mutex_enter(some_lock_at_IPL_VM);
printf("foo\n");
fpu_kern_enter();
...
fpu_kern_leave();
mutex_exit(some_lock_at_IPL_VM);

will trigger the panic, because printf takes a lock at IPL_HIGH where
the IPL wil remain until the mutex_exit. (This was a nightmare to
track down before I remembered that detail of spin lock IPL
semantics...)


# 1.78 24-May-2022 andvar

fix various typos in comments, docs and log messages.


# 1.77 01-Apr-2022 riastradh

x86, arm: Allow fpu_kern_enter/leave while cold.

Normally these are forbidden above IPL_VM, so that FPU usage doesn't
block IPL_SCHED or IPL_HIGH interrupts. But while cold, e.g. during
builtin module initialization at boot, all interrupts are blocked
anyway so it's a moot point.

Also initialize x86 cpu_info_primary.ci_kfpu_spl to -1 so we don't
trip over an assertion about it while cold -- the assertion is meant
to detect reentrance into fpu_kern_enter/leave, which is prohibited.

Also initialize cpu0's ci_kfpu_spl.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.76 24-Oct-2020 mgorny

Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).


# 1.75 15-Oct-2020 mgorny

Revert "Merge convert_xmm_s87.c into fpu.c"

I am going to add ATF tests for these two functions, and having them
in a separate file will make it more convenient to build and run them
in userspace.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

branches: 1.55.2;
More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.78 24-May-2022 andvar

fix various typos in comments, docs and log messages.


# 1.77 01-Apr-2022 riastradh

x86, arm: Allow fpu_kern_enter/leave while cold.

Normally these are forbidden above IPL_VM, so that FPU usage doesn't
block IPL_SCHED or IPL_HIGH interrupts. But while cold, e.g. during
builtin module initialization at boot, all interrupts are blocked
anyway so it's a moot point.

Also initialize x86 cpu_info_primary.ci_kfpu_spl to -1 so we don't
trip over an assertion about it while cold -- the assertion is meant
to detect reentrance into fpu_kern_enter/leave, which is prohibited.

Also initialize cpu0's ci_kfpu_spl.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.76 24-Oct-2020 mgorny

Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).


# 1.75 15-Oct-2020 mgorny

Revert "Merge convert_xmm_s87.c into fpu.c"

I am going to add ATF tests for these two functions, and having them
in a separate file will make it more convenient to build and run them
in userspace.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

branches: 1.55.2;
More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.77 01-Apr-2022 riastradh

x86, arm: Allow fpu_kern_enter/leave while cold.

Normally these are forbidden above IPL_VM, so that FPU usage doesn't
block IPL_SCHED or IPL_HIGH interrupts. But while cold, e.g. during
builtin module initialization at boot, all interrupts are blocked
anyway so it's a moot point.

Also initialize x86 cpu_info_primary.ci_kfpu_spl to -1 so we don't
trip over an assertion about it while cold -- the assertion is meant
to detect reentrance into fpu_kern_enter/leave, which is prohibited.

Also initialize cpu0's ci_kfpu_spl.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.76 24-Oct-2020 mgorny

Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).


# 1.75 15-Oct-2020 mgorny

Revert "Merge convert_xmm_s87.c into fpu.c"

I am going to add ATF tests for these two functions, and having them
in a separate file will make it more convenient to build and run them
in userspace.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

branches: 1.55.2;
More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.76 24-Oct-2020 mgorny

Issue 64-bit versions of *XSAVE* for 64-bit amd64 programs

When calling FXSAVE, XSAVE, FXRSTOR, ... for 64-bit programs on amd64
use the 64-suffixed variant in order to include the complete FIP/FDP
registers in the x87 area.

The difference between the two variants is that the FXSAVE64 (new)
variant represents FIP/FDP as 64-bit fields (union fp_addr.fa_64),
while the legacy FXSAVE variant uses split fields: 32-bit offset,
16-bit segment and 16-bit reserved field (union fp_addr.fa_32).
The latter implies that the actual addresses are truncated to 32 bits
which is insufficient in modern programs.

The change is applied only to 64-bit programs on amd64. Plain i386
and compat32 continue using plain FXSAVE. Similarly, NVMM is not
changed as I am not familiar with that code.

This is a potentially breaking change. However, I don't think it likely
to actually break anything because the data provided by the old variant
were not meaningful (because of the truncated pointer).


# 1.75 15-Oct-2020 mgorny

Revert "Merge convert_xmm_s87.c into fpu.c"

I am going to add ATF tests for these two functions, and having them
in a separate file will make it more convenient to build and run them
in userspace.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

branches: 1.55.2;
More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.74 02-Aug-2020 riastradh

Revert "Add kthread_fpu_enter/exit support to x86." for now.

Need to find all the paths out of interrupts back into _kernel_
context to add HANDLE_DEFERRED_FPU, I think, before this can be
enabled.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.73 01-Aug-2020 riastradh

Add kthread_fpu_enter/exit support to x86.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.72 20-Jul-2020 riastradh

Fix fpu_kern_enter in a softint that interrupted a softint.

We need to find the lwp that was originally interrupted to save its
fpu state.

With this, fpu-heavy programs (like firefox) are once again stable,
at least under modest stress testing, on systems configured to use
wifi with WPA2 and CCMP.


# 1.71 20-Jul-2020 riastradh

Save fpu state at IPL_VM to exclude fpu_kern_enter/leave.

This way fpu_kern_enter/leave cannot interrupt the transition, so the
transition from state-on-CPU to state-in-memory (with TS set) is
atomic whether in an interrupt or not.

(I am not 100% convinced that this is necessary, but it makes
reasoning about the transition simpler.)


# 1.70 20-Jul-2020 riastradh

Revert 1.66 "Fix race in fpu save with fpu_kern_enter in softint."

This only fixed part of the race, and we can do it more simply.


# 1.69 20-Jul-2020 riastradh

Revert 1.67 "Restore the lwp's fpu state, not zeros, and leave with fpu enabled."

This didn't actually avoid double-restore, and it doesn't solve the
problem anyway, and made it harder to detect in-kernel fpu abuse.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.68 13-Jul-2020 riastradh

Limit x86 fpu_kern_enter/leave to IPL_VM or below.

There are no users of crypto at IPL_SCHED or IPL_HIGH as far as I
know, and although we generally limit the amount of time spent in any
one crypto operation -- e.g., cgd is usually limited to processing
512 or 4096 bytes at a time -- it's better not to block IPL_SCHED and
IPL_HIGH interrupts at all. This should make ddb a little more
accessible during crypto-heavy workloads.

This means the aes_* API cannot be used at IPL_SCHED or IPL_HIGH; the
same will go for any new crypto subsystems, like the ChaCha and
Poly1305 ones I'm drafting. It might be better to prohibit them
altogether in hard interrupt context, but right now cprng_fast and
cprng_strong are both technically allowed at IPL_VM and are sometimes
used there (e.g., for opencrypto CBC IV generation).

KASSERT the ilevel to detect violation of this constraint in case I'm
wrong.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.67 06-Jul-2020 riastradh

Restore the lwp's fpu state, not zeros, and leave with fpu enabled.

We need to clear the fpu state anyway because it is likely to contain
secrets at this point. Previously we set it to zeros, and then issued
stts to disable the fpu in order to detect the mistake of further use
of the fpu in kernel. But there must be some path I haven't identified
yet that doesn't do fpu_handle_deferred, leading to fpudna panics.

In any case, there's no benefit to restoring the fpu state twice
(once with zeros and once with the real data). The downside is,
although this avoids spurious fpudna traps, using fpu_kern_enter in a
softint has the side effect that -- until the next userland context
switch triggering stts -- we no longer detect misuse of fpu in the
kernel in that lwp. This will serve for now, but we should find
another way to issue clts/stts judiciously to detect such misuse.

May improve the continued symptoms of
https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html
although may not fix everything.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.66 06-Jul-2020 riastradh

Fix race in fpu save with fpu_kern_enter in softint.

Likely source of:

https://mail-index.netbsd.org/current-users/2020/07/02/msg039051.html


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.65 14-Jun-2020 riastradh

Use static constant rather than stack memset buffer for zero fpregs.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.64 13-Jun-2020 riastradh

Add comments over fpu_kern_enter/leave.


# 1.63 13-Jun-2020 riastradh

Zero the fpu registers on fpu_kern_leave.

Avoid Spectre-class attacks on any values left in them.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.62 04-Jun-2020 riastradh

Call clts/stts in fpu_kern_enter/leave so they work.


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421 bouyer-xenpvh-base1 phil-wifi-20200411 bouyer-xenpvh-base is-mlppp-base phil-wifi-20200406 ad-namecache-base3
# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

branches: 1.60.2;
Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.61 31-Jan-2020 maxv

'oldlwp' is never NULL now, so remove the NULL checks.


Revision tags: ad-namecache-base2 ad-namecache-base1 ad-namecache-base
# 1.60 27-Nov-2019 maxv

Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-0-RC1 netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.60 27-Nov-2019 maxv

Add a small API for in-kernel FPU operations.

fpu_kern_enter();
/* do FPU stuff */
fpu_kern_leave();


Revision tags: phil-wifi-20191119
# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.59 30-Oct-2019 maxv

Style.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.58 12-Oct-2019 maxv

Rewrite the FPU code on x86. This greatly simplifies the logic and removes
the dependency on IPL_HIGH. NVMM is updated accordingly. Posted on
port-amd64 a week ago.

Bump the kernel version to 9.99.16.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.57 04-Oct-2019 maxv

Rename fpu_eagerswitch to fpu_switch, and add fpu_xstate_reload to
simplify.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.56 03-Oct-2019 maxv

Remove the LazyFPU code, as posted 5 months ago on port-amd64@.


Revision tags: netbsd-9-base
# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.55 05-Jul-2019 maxv

More inlines, prerequisites for future changes. Also, remove fngetsw(),
which was a duplicate of fnstsw().


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.54 26-Jun-2019 mgorny

Implement PT_GETXSTATE and PT_SETXSTATE

Introduce two new ptrace() requests: PT_GETXSTATE and PT_SETXSTATE,
that provide access to the extended (and extensible) set of FPU
registers on amd64 and i386. At the moment, this covers AVX (YMM)
and AVX-512 (ZMM, opmask) registers. It can be easily extended
to cover further register types without breaking backwards
compatibility.

PT_GETXSTATE issues the XSAVE instruction with all kernel-supported
extended components enabled. The data is copied into 'struct xstate'
(which -- unlike the XSAVE area itself -- has stable format
and offsets).

PT_SETXSTATE issues the XRSTOR instruction to restore the register
values from user-provided 'struct xstate'. The function replaces only
the specific XSAVE components that are listed in 'xs_rfbm' field,
making it possible to issue partial updates.

Both syscalls take a 'struct iovec' pointer rather than a direct
argument. This requires the caller to explicitly specify the buffer
size. As a result, existing code will continue to work correctly
when the structure is extended (performing partial reads/updates).


Revision tags: phil-wifi-20190609
# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

branches: 1.43.2;
Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.53 25-May-2019 maxv

Fix bug. We must fetch the whole FPU state, otherwise XSTATE_BV could be
outdated, and we could be filling the AVX registers with garbage.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.52 19-May-2019 maxv

Rename

fpu_save_area_clear -> fpu_clear
fpu_save_area_reset -> fpu_sigreset

Clearer, and reduces a future diff. No real functional change.


# 1.51 19-May-2019 maxv

Misc changes in the x86 FPU code. Reduces a future diff. No real functional
change.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


Revision tags: isaki-audio2-base
# 1.50 11-Feb-2019 cherry

We reorganise definitions for XEN source support as follows:

XEN - common sources required for baseline XEN support.
XENPV - sources required for support of XEN in PV mode.
XENPVHVM - sources required for support for XEN in HVM mode.
XENPVH - sources required for support for XEN in PVH mode.


Revision tags: pgoyette-compat-20190127
# 1.49 20-Jan-2019 maxv

Improvements in NVMM

* Handle the FPU differently, limit the states via the given mask rather
than via XCR0. Align to 64 bytes. Provide an initial gXCR0, to be sure
that XCR0_X87 is set. Reset XSTATE_BV when the state is modified by
the virtualizer, to force a reload from memory.

* Hide RDTSCP.

* Zero-extend RBX/RCX/RDX when handling the NVMM CPUID signature.

* Take ECX and not RCX on MSR instructions.


Revision tags: pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126 pgoyette-compat-1020
# 1.48 05-Oct-2018 maxv

export x86_fpu_mxcsr_mask, fpu_area_save and fpu_area_restore


Revision tags: pgoyette-compat-0930
# 1.47 17-Sep-2018 maxv

Reduce the noise, reorder and rename some things for clarity.


Revision tags: pgoyette-compat-0906 pgoyette-compat-0728
# 1.46 01-Jul-2018 maxv

Use a variable-sized memcpy, instead of copying the PCB and then adding
the extra bytes. The PCB embeds the biggest static FPU state, but our
real FPU state may be smaller (FNSAVE), so we don't need to memcpy the
extra unused bytes.


# 1.45 01-Jul-2018 maxv

Use a switch, we can (and will) optimize each case separately. No
functional change.


# 1.44 29-Jun-2018 maxv

Add more KASSERTs.

Should help PR/53399.


Revision tags: phil-wifi-base pgoyette-compat-0625
# 1.43 23-Jun-2018 maxv

Add XXX in fpuinit_mxcsr_mask.


# 1.42 22-Jun-2018 maxv

Revert jdolecek's changes related to FXSAVE. They just didn't make any
sense and were trying to hide a real bug, which is, that there is for some
reason a wrong stack alignment that causes FXSAVE to fault in
fpuinit_mxcsr_mask. As seen in current-users@ yesterday, rdi % 16 = 8. And
as seen several months ago, as well.

The rest of the changes in XSAVE are wrong too, but I'll let him fix these
ones.


# 1.41 20-Jun-2018 jdolecek

as a stop-gap, make fpuinit_mxcsr_mask() for native independant of
XSAVE as it should be, only xen case checks the flag now; need to
investigate further why exactly the fault happens for the xen
no-xsave case

pointed out by maxv


# 1.40 19-Jun-2018 jdolecek

fix FPU initialization on Xen to allow e.g. AVX when supported by hardware;
only use XSAVE when the the CPUID OSXSAVE bit is set, as this seems to be
reliable indication

tested with Xen 4.2.6 DOM0/DOMU on Intel CPU, without and with no-xsave flag,
so should work also on those AMD CPUs, which have XSAVE disabled by default;
also tested with Xen DOM0 4.8.3

fixes PR kern/50332 by Torbjorn Granlund; sorry it took three years to address

XXX pullup netbsd-8


# 1.39 19-Jun-2018 maxv

When using EagerFPU, create the fpu state in execve at IPL_HIGH.

A preemption could occur in the middle, and we don't want that to happen,
because the context switch would use the partially-constructed fpu state.

The procedure becomes:

splhigh
unbusy the current cpu's fpu
create a new fpu state in memory
install the state on the current cpu's fpu
splx

Disabling preemption also ensures that x86_fpu_eager doesn't change in
the middle.

In LazyFPU mode we drop IPL_HIGH right away.

Add more KASSERTs.


# 1.38 18-Jun-2018 maxv

Add more KASSERTs, see if they help PR/53383.


# 1.37 17-Jun-2018 maxv

No, I meant to put the panic in fpudna not fputrap. Also appease it: panic
only if the fpu already has a state. We're fine with getting a DNA, what
we're not fine with is if the DNA is received while the FPU is busy.

I believe (even though I couldn't trigger it) that the panic would
otherwise fire if PT_SETFPREGS is used. And also ACPI sleep/wakeup,
probably.


# 1.36 16-Jun-2018 maxv

Need IPIs when enabling eager fpu switch, to clear each fpu and get us
started. Otherwise it is possible that the first context switch on one of
the cpus will restore an invalid fpu state in the new lwp, if that lwp
had its fpu state stored on another cpu that didn't have time to do an
fpu save since eager-fpu was enabled.

Use barriers and all the related crap. The point is that we want to
ensure that no context switch occurs between [each fpu is cleared] and
[x86_fpu_eager is set to 'true'].

Also add KASSERTs.


# 1.35 16-Jun-2018 maxv

Actually, don't do anything if we switch to a kernel thread. When the cpu
switches back to a user thread the fpu is restored, so no point calling
fninit (which doesn't clear all the states anyway).


# 1.34 14-Jun-2018 maxv

Install the FPU state on the current CPU in setregs (execve).


# 1.33 14-Jun-2018 maxv

Add some code to support eager fpu switch, INTEL-SA-00145. We restore the
FPU state of the lwp right away during context switches. This guarantees
that when the CPU executes in userland, the FPU doesn't contain secrets.

Maybe we also need to clear the FPU in setregs(), not sure about this one.

Can be enabled/disabled via:

machdep.fpu_eager = {0/1}

Not yet turned on automatically on affected CPUs (Intel Family 6).

More generally it would be good to turn it on automatically when XSAVEOPT
is supported, because in this case there is probably a non-negligible
performance gain; but we need to fix PR/52966.


# 1.32 23-May-2018 maxv

Add a comment about recent AMD CPUs.


# 1.31 23-May-2018 maxv

Clarify and extend the fix for the AMD FPU leaks. We were clearing the x87
state only on FXRSTOR, but the same problem exists on XRSTOR, so clear the
state there too.


# 1.30 23-May-2018 maxv

Merge convert_xmm_s87.c into fpu.c. It contains only two functions, that
are used only in fpu.c.


# 1.29 23-May-2018 maxv

style


Revision tags: pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base
# 1.28 09-Feb-2018 maxv

branches: 1.28.2;
Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.28 09-Feb-2018 maxv

Force a reload of CW in fpu_set_default_cw(). This function is used only
in COMPAT_FREEBSD, it really needs to die.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


Revision tags: tls-maxphys-base-20171202
# 1.27 11-Nov-2017 maxv

Recommit

http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html

but use __INITIAL_MXCSR_MASK__ on Xen until someone figures out what's
wrong with the Xen fpu.


# 1.26 11-Nov-2017 bouyer

Revert http://mail-index.netbsd.org/source-changes/2017/11/08/msg089525.html,
it breaks Xen:
http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201711082340Z_anita.txt


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

branches: 1.12.8;
Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.8; 1.9.10; 1.9.12; 1.9.16;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.25 08-Nov-2017 maxv

Call fpuinit_mxcsr_mask in cpu_init, after cr4 is initialized, but before
touching xcr0. Then use clts/stts instead of modifying cr0, and enable the
mxcsr_mask detection on Xen.


# 1.24 04-Nov-2017 maxv

Add support for xsaveopt. It is basically an instruction that optimizes
context switch performance by not saving to memory FPU registers that are
known to be in their initial state or known not to have changed since the
last time they were saved to memory.

Our code is now compatible with the internal state tracking engine:
- We don't modify the in-memory FPU state after doing an XSAVE/XSAVEOPT.
That is to say, we always call XRSTOR first.
- During a fork, the whole in-memory FPU state area is memcopied in the
new PCB, and CR0_TS is set. Next time the forked thread uses the FPU it
will fault, we migrate the area, call XRSTOR and clear CR0_TS. During
this XRSTOR XSTATE_BV still contains the initial values, and it forces
a reload of XINUSE.
- Whenever software wants to change the in-memory FPU state, it manually
sets XSTATE_BV[i]=1, which forces XINUSE[i]=1.
- The address of the state passed to xrstor is always the same for a
given LWP.

fpu_save_area_clear is changed not to force a reload of CW if fx_cw is
the standard FPU value. This way we have XINUSE[i]=0 for x87, and xsaveopt
will optimize this state.

Small benchmark:
switch lwp to cpu2
do float operation
switch lwp to cpu3
do float operation
Doing this 10^6 times in a loop, my cpu goes on average from 28,2 seconds
to 20,8 seconds.


# 1.23 04-Nov-2017 maxv

Always set XCR0_X87, to force a reload of CW. That's needed for compat
options where fx_cw is not the standard fpu value.


# 1.22 04-Nov-2017 maxv

Fix xen. Not tested, but seems fine enough.


# 1.21 03-Nov-2017 maxv

Fix MXCSR_MASK, it needs to be detected dynamically, otherwise when masking
MXCSR we are losing some features (eg DAZ).


# 1.20 31-Oct-2017 maxv

Zero out the buffer entirely.


# 1.19 31-Oct-2017 maxv

Mask mxcsr, otherwise userland could set reserved bits to 1 and make
xrstor fault.


# 1.18 31-Oct-2017 maxv

Initialize xstate_bv with the structures that were just filled in,
otherwise xrstor does not restore them. This can happen only if userland
calls setcontext without having used the FPU before.

Until rev1.15 xstate_bv was implicitly initialized because the xsave area
was not zeroed out properly.


# 1.17 31-Oct-2017 maxv

Don't embed our own values in the reserved fields of the XSAVE area, it
really is a bad idea. Move them into the PCB.


# 1.16 31-Oct-2017 maxv

Always use x86_fpu_save, clearer.


# 1.15 31-Oct-2017 maxv

Remove comments that are more misleading than anything else. While here
make sure we zero out the FPU area entirely, and not just its legacy
region.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: matt-nb8-mediatek-base nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.10;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


# 1.14 09-Oct-2017 maya

GC i386_fpu_present. no FPU x86 is not supported.

Also delete newly unused send_sigill


# 1.13 17-Sep-2017 maxv

Remove the second argument from USERMODE and KERNELMODE, it is unused
now that we don't have vm86 anymore.


Revision tags: nick-nhusb-base-20170825 perseant-stdc-iso10646-base netbsd-8-base prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base pgoyette-localcount-20170320 nick-nhusb-base-20170204 bouyer-socketcan-base pgoyette-localcount-20170107 nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.10;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104 nick-nhusb-base-20161004
# 1.12 29-Sep-2016 maxv

Remove outdated comments, typos, rename and reorder a few things.


Revision tags: localcount-20160914
# 1.11 18-Aug-2016 maxv

Simplify.


Revision tags: pgoyette-localcount-20160806 pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319 nick-nhusb-base-20151226 nick-nhusb-base-20150921 nick-nhusb-base-20150606 nick-nhusb-base-20150406 nick-nhusb-base
# 1.10 27-Nov-2014 uebayasi

branches: 1.10.2; 1.10.4;
Consistently use kpreempt_*() outside scheduler path.


Revision tags: netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 tls-maxphys-base netbsd-7-base rmind-smpnet-base rmind-smpnet-nbase yamt-pagecache-base9 tls-earlyentropy-base riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3
# 1.9 25-Feb-2014 dsl

branches: 1.9.4; 1.9.6; 1.9.10;
Add support for saving the AVX-256 ymm registers during FPU context switches.
Add support for the forthcoming AVX-512 registers.
Code compiled with -mavx seems to work, but I've not tested context
switches with live ymm registers.
There is a small cost on fork/exec (a larger area is copied/zerod),
but I don't think the ymm registers are read/written unless they
have been used.
The code use XSAVE on all cpus, I'm not brave enough to enable XSAVEOPT.


# 1.8 23-Feb-2014 dsl

Add fpu_set_default_cw() and use it in the emulations to set the default
x87 control word.
This means that nothing outside fpu.c cares about the internals of the
fpu save area.
New kernel modules won't load with the old kernel - but that won't matter.


# 1.7 23-Feb-2014 dsl

Determine whether the cpu supports xsave (and hence AVX).
The result is only written to sysctl nodes at the moment.
I see:
machdep.fpu_save = 3 (implies xsaveopt)
machdep.xsave_size = 832
machdep.xsave_features = 7
Completely common up the i386 and amd64 machdep sysctl creation.


# 1.6 15-Feb-2014 dsl

Load and save the fpu registers (for copies to/from userspace) using
helper functions in arch/x86/x86/fpu.c
They (hopefully) ensure that we write to the entire buffer and don't load
values that might cause faults in kernel.
Also zero out the 'pad' field of the i386 mcontext fp area that I think
once contained the registers of any Weitek fpu.
Dunno why it wasn't pasrt of the union.
Some of these copies could be removed if the code directly copied the save
area to/from userspace addresses.


# 1.5 15-Feb-2014 dsl

Remove all references to MDL_USEDFPU and deferred fpu initialisation.
The cost of zeroing the save area on exec is minimal.
This stops the FP registers of a random process being used the first
time an lwp uses the fpu.
sendsig_siginfo() and get_mcontext() now unconditionally copy the FP
registers.
I'll remove the double-copy for signal handlers soon.
get_mcontext() might have been leaking kernel memory to userspace - and
may still do so if i386_use_fxsave is false (short copies).


# 1.4 13-Feb-2014 dsl

Check the argument types for the fpu asm functions.


# 1.3 12-Feb-2014 dsl

Change i386 to use x86/fpu.c instead of i386/isa/npx.c
This changes the trap10 and trap13 code to call directly into fpu.c,
removing all the code for T_ARITHTRAP, T_XMM and T_FPUNDA from i386/trap.c
Not all of the code thate appeared to handle fpu traps was ever called!
Most of the changes just replace the include of machine/npx.h with x86/fpu.h
(or remove it entirely).


# 1.2 12-Feb-2014 dsl

Change the argument to fpudna() to be the trapframe.
Move the checks for fpu traps in kernel into x86/fpu.c.
Remove the code from amd64/trap.c related to fpu traps (they've not gone
there for ages - expect to panic in kernel mode).
In fpudna():
- Don't actually enable hardware interrupts unless we need to
allow in IPIs.
- There is no point in enabling them when they are blocked in software
(by splhigh()).
- Keep the splhigh() to avoid a load of the KASSERTS() firing.


# 1.1 11-Feb-2014 dsl

Move sys/arch/amd64/amd64/fpu.c and sys/arch/amd64/include/fpu.h
into sys/arch/x86 in preparation for using the same code for i386.