Cross Reference: /freebsd-9.3-release/sys/ia64/ia64/syscall.S

History log of /freebsd-9.3-release/sys/ia64/ia64/syscall.S
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 267654	19-Jun-2014	gjb	Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-9.3-release
# 247018	20-Feb-2013	marcel	MFC r246890: Close a race relating to setting the PCPU pointer (r13).
# 225736	22-Sep-2011	kensmith	Copy head to stable/9 as part of 9.0-RELEASE release cycle. Approved by: re (implicit)
# 221894	14-May-2011	marcel	Prefer switching the memory stack from user to kernel before switching the register stack. While the ordering doesn't matter, it creates an invariant not previously there: the memory stack pointer will always be larger than the register stack pointer. With this invariant in place, it's easier to add instrumentation code that detects a stack overflow because in such a scenario the memory stack pointer and register stack pointers have crossed each other. Aside: basic kernel operation needs about half the stack size (~16K) at most. We have plenty of head room on the kernel stack...
# 204184	21-Feb-2010	marcel	Prefer I-units and M-units for nop instructions. This works around McKinley flaws. It also avoids using the F-unit in the kernel for no reason.
# 171463	16-Jul-2007	marcel	Restore the value of ar.rnat after the assignment to ar.bspstore. The SDM states that writing to ar.bspstore invalidates the ar.rnat register as a side-effect. This was interpreted as "bits in the ar.rnat register that correspond to registers whose value is on the stack are undefined'. Since we keep the kernel stack NaT- aligned with the user stack (i.e. the lower 9 bits of the backing store pointer remain unchanged when we switch to the kernel stack) bits that need preserving would be preserved. That interpretation is questionable. So, now, the interpretation is more absolute: ar.rnat is undefined after writing to ar.bspstore. As such, we write the saved value of ar.rnat back to ar.rnat after writing to ar.bspstore. Discussed with: christian.kandeler@hob.de Approved by: re (kensmith)
# 139790	06-Jan-2005	imp	/* -> /*- for copyright notices, minor format tweaks as necessary
# 134502	29-Aug-2004	marcel	s/ENTRY/ENTRY_NOPROFILE/g for particular functions that do not follow the C calling convention or are otherwise not regular functions. This allows us to boot a profiling kernel.
# 133286	07-Aug-2004	marcel	Slightly move labels around to make sure we call ast() on our way out after a fork(2) in fork_trampoline(). By moving the epc_syscall_return label immediately before the call to do_ast() in epc_syscall(), we not only achieve that but also handle the detour through exception_return when the frame corresponds to an asynchronous kernel entry. Hence, we simplified fork_trampoline() as a side-effect.
# 122479	11-Nov-2003	marcel	Fix a nasty bug that got exposed when the sendsig() and sigreturn() functions switched to using {g\|s}et_mcontext(). The problem is that sigreturn(), being a syscall, can be given an async. context (i.e. one corresponding to an interrupt or trap). When this happens, we try to return to user mode via epc_syscall_return with a trapframe that can only be used to return to user mode via exception_restore. To fix this, we check the frame's flags immediately prior to epc_syscall_return and branch to exception_restore for non-syscall frames. Modify the assertion in set_mcontext() to check that if there's a mismatch, it's because of sigreturn().
# 122368	09-Nov-2003	marcel	Use get_mcontext() to construct the signal context in sendsig() and use set_mcontext() to restore the context in sigreturn(). Since we put the syscall number and the syscall arguments in the trapframe (we don't save the scratch registers for syscalls, which allows us to reuse the space to our advantage), create a MD specific flag so that we save the scratch registers even for syscalls. We would not be able to restart a syscall otherwise. The signal trampoline does not need to flush the regiters anymore, because get_mcontext() already handles that. In fact, if we set up the context correctly, we do not need to have a trampoline at all. This change however only minimally changes the trampoline code. In follow-up commits this can be further optimized. Note that normally we preserve cfm and iip in the trapframe created by the EPC syscall path when we restore a context in set_mcontext() because those fields are not normally set for a synchronuous context. The kernel puts the return address and frame info of the syscall stub in there. By preserving these fields we hide this detail from userland which allows us to use setcontext(2) for user created contexts. However, sigreturn() is commonly called from the trampoline, which means that if we preserve cfm and iip in all cases, we would return to the trampoline after the sigreturn(), which means we hit the safety net: we call exit(2). So, we do not preserve cfm and iip when we have a synchronous context that also has scratch registers (the uncommon context created by sendsig() only), under the assumption that if such a context is created in userland, something special is going on and the use of cfm and iip is then just another quirk. All this is invisible in the common case.
# 121635	28-Oct-2003	marcel	When switching the RSE to use the kernel stack as backing store, keep the RNAT bit index constant. The net effect of this is that there's no discontinuity WRT NaT collections which greatly simplifies certain operations. The cost of this is that there can be up to 504 bytes of unused stack between the true base of the kernel stack and the start of the RSE backing store. The cost of adjusting the backing store pointer to keep the RNAT bit index constant, for each kernel entry, is negligible. The primary reasons for this change are: 1. Asynchronuous contexts in KSE processes have the disadvantage of having to copy the dirty registers from the kernel stack onto the user stack. The implementation we had so far copied the registers one at a time without calculating NaT collection values. A process that used speculation would not work. Now that the RNAT bit index is constant, we can block-copy the registers from the kernel stack to the user stack without having to worry about NaT collections. They will be in the right place on the user stack. 2. The ndirty field in the trapframe is now also usable in userland. This was previously not the case because ndirty also includes the space occupied by NaT collections. The value could be off by 8, depending on the discontinuity. Now that the RNAT bit index is contants, we have exactly the same number of NaT collection points on the kernel stack as we would have had on the user stack if we didn't switch backing stores. 3. Debuggers and other applications that use ptrace(2) can now copy the dirty registers from the kernel stack (using ptrace(2)) and copy them whereever they want them (onto the user stack of the inferior as might be the case for gdb) without having to worry about NaT collections in the same way the kernel doesn't have to worry about them. There's a second order effect caused by the randomization of the base of the backing store, for it depends on the number of dirty registers the processor happened to have at the time of entry into the kernel. The second order effect is that the RSE will have a better cache utilization as compared to having the backing store always aligned at page boundaries. This has not been measured and may be in practice only minimally beneficial, if at all measurable.
# 120683	03-Oct-2003	marcel	Swap the syscall caller frame info (i.e. the return pointer and frame marker) and the syscall stub frame info in the trap frame. Previously we stored the stub frame info in (rp,pfs) and the caller frame info in (iip,cfm). This ends up being suboptimal for the following reasons: 1. When we create a new context, such as for an execve(2), we had to set the (rp,pfs) pair for the entry point when using the syscall path out of the kernel but we need to set the (iip,cfm) pair when we take the interrupt way out. This is mostly just an inconsistency from the kernel's point of view, but an ugly irregularity from gdb(1)'s point of view. 2. The getcontext(2) and setcontext(2) syscalls had to swap the (rp,pfs) and (iip,cfm) pairs to make the context compatible with one created purely in userland. Swapping the (rp,pfs) and (iip,cfm) pairs is visible to signal handlers that actually peek at the mcontext_t and to gdb(1). Since this change is made for gdb(1) and we don't care about signal handlers that peek at the mcontext_t because we're still a tier 2 platform, this ABI breakage is academic at this moment in time. Note that there was no real reason to save the caller frame info in (iip,cfm) and the stub frame info in (rp,pfs).
# 118851	13-Aug-2003	marcel	Put an instruction group break between the move to ar.rnat and the move to ar.rsc. The RSE must be in enforced lazy mode when writing to RSE modifyable registers. In this case we restore the RSE NaT collection register ar.rnat. I have seen 2 general exception faults on pluto1 now that indicate that the move to ar.rsc has already happened prior to the move to ar.rnat, meaning that the RSE is not in enforced lazy mode anymore. The ia64 dependency and instruction ordering rules seem to allow having both registers written to in the same instruction group, provided ar.rsc is written to later than ar.rnat (based on the ordering semantics). It appears that we may be pushing our luck. For now, put them in seperate cycles (by means of the instruction group break). If we ever get a general exception fault on the move to ar.rnat again, we have definite proof that something else is fishy.
# 118563	06-Aug-2003	marcel	o In revision 1.45 of exception.S we changed exception_restore to unconditionally restore ar.k7 (kernel memory stack) and ar.k6 (kernel register stack). I don't know what I was smoking then, but if you unconditionally restore ar.k6, you also want to compute its value unconditionally. By having the computation predicated and dependent on whether we return to user mode, we would end up writing junk (= invalid value for ar.bspstore) if we would return to kernel mode. But the whole point of the unconditional restoration was that there is a grey area where we still need to have ar.k6 restored. If we restore with a junk value, we would end up wedging the machine on the next interrupt. So, unconditionally calculate the value we unconditionally write to ar.k6. o The previous braino was found while making the following change: We used to clear the lower 9 bits of the value we write to ar.k6. The meaning being that we know that the kernel register stack is at least 512 byte aligned and simply clearing the lower 9 bits allows us to return to a context of which we don't have dirty registers on the kernel stack, even though the context that entered the kernel does have dirty registers on the kernel stack. By masking-off the lower bits, we correctly obtain the base of the register stack without having to worry that we didn't actually reached the base while unwinding it. The change is to mask off the lower 13 bits, knowing that the kernel register stack is always 8KB aligned. The advantage is that we don't have to worry anymore if there's more than 512 bytes of dirty registers on the kernel stack. A situation that frequently occurs. In exec_setregs() in machdep.c:1.147 or older, we had to deal with that situation by copying the active portion of the register stack down in multiples of 512 bytes. Now that we mask off the lower 13 bits we don't have to do that at all. Contemporary IPF processors have a register file that can hold up to 96 stacked registers (=784 bytes [incl. 2 NaT collections]). With no indication that register files grow beyond a couple of hundred registers, we should not have to worry about it anymore... and yes, 640KB is enough for everybody :-) This change helps setcontext(2) and cpu_set_upcall_kse() in that they can return to completely different contexts without having to mess with the kernel stack. Of course exec_setregs() doesn't need to do that anymore as well.
# 117437	11-Jul-2003	marcel	Add a body directive before the first instruction in epc_syscall(). This results in a zero length prologue and a body that covers the whole function. This is more correct.
# 117161	02-Jul-2003	ru	The .s files were repo-copied to .S files. Approved by: marcel Repocopied by: joe
# 115563	31-May-2003	marcel	Some ia32 related finetuning for the EPC syscall path: o The SDM states that flushing the RSE in the cycle prior to the call to ia32 code yields the best performance. We don't really care to much about performance here, but we do the same anyway. I'm being paranoia and conservative here. o Only initialize the ia32 state registers, not the registers used as scratch by the ia32 engine. This saves a couple of loads from the trapframe, but also helps debugging: we don't clobber useful debugging data (engineering hints :-) o Make sure all general registers constituting ia32 state have been initialized. If there's no useful to be loaded from the trapframe, clear the register. This avoids accidentally leaking NaT bits. o Make sure we set ar.k6 prior to clobbering ar.bspstore and also set ar.k7 prior to setting sp. This fixes a race seen for ia64 native code as well (and previously fixed too).
# 115296	24-May-2003	marcel	Fix a source of instability specific to an EPC userland. We return to userland with interrupts disabled until we restore PSR. However, it has been observed that interrupts do actually happen before they are enabled again. This is a bit surprising and I don't know yet what's going on exactly. Nevertheless, the code was not crafted carefully enough to allow interrupts to happen and we could clobber the kernel stack of another thread when interrupts did happen. This is what happens: we restore the (memory) stack pointer (sp) and the register stack base prior to restoring ar.k6 and ar.k7. This is not a problem if interrupts don't happen between setting sp/ar.bspstore and ar.k6/ar.k7. Alas, interrupts can happen. Since sp/ar.bspstore already point to the userland stacks, we need to switch to the kernel stack in interrupt. However, ar.k6 and ar.k7 have not been set, which means that we were switching to some unrelated kstack and happily clobbered the trapframe present there if the thread to which the kstack belonged was in kernel mode or otherwise we could have our trapframe clobbered if that other thread enters the kernel. Nasty either way. We now carefully restore ar.k6 prior to restoring ar.bspstore and likewise for ar.k7 and sp. All we need is the guarantee that an interrupt does not clobber ar.k6 or ar.k7 before we're back in userland. That has been achieved by restoring ar.k6/ar.k7 unconditionally (see exception.s) While here, remove the disabling of interrupts on EPC entry. It was added as a way to "resolve" the crashes until it was understood what was going on. I think I achieved the latter, so we can remove the patch. Note that setting up a trapframe with interrupts enabled has it's own share of corner cases, but it's better to properly fixed those than to keep a mostly wrong patch around because we're afraid to remove it... Approved by: re@ (blanket)
# 115017	15-May-2003	marcel	This file contains the code that implements the syscall path based on the epc instruction. The epc instruction, given the permissions of the page in which the epc is located, allows the privilege level to be increased with little or no overhead. The previous privilege level is recorded in the current frame marker and is restored by a regular (function) return. Since the epc instruction has to live in a page with non-standard properties, we hardwire a "gateway" page in the address space. The address of the gateway page is exported to userland in ar.k7. This allows us to rewire the page without breaking the ABI. The syscall stubs in libc are regular function calls that slightly differ from the normal runtime. The difference is mostly to simplify the stubs themselves by by moving some of the logic to the kernel. The libc stubs call into the gateway page (offset 0), from where the kernel trampolines to the code that sets up a minimal trapframe and arranges to execute from the kernel stack. The way back is basicly the same. The kernel returns to the gateway page, whereby privilege is dropped, and jumps back to the syscall stub. Only the special registers are saved in the trapframe. None of the scratch registers are preserved and since the kernel follows the same runtime model, none of the preserved registers are saved. Future enhancements can include the implementation of lightweight syscalls, where kernel functions are performed without setting up a trapframe. Good candidates are the *context syscalls for example. Now that there's a gateway page from which code can be executed in a non-privileged context, we also have the ideal place to put the signal trampolines. By moving the signal trampolines from the user stack to the gateway page, we open up the doors to unexecutable stacks. The gateway page contains signal trampolines for both the "legacy" break-based syscall code and the new and improved epc- based syscall code. Approved: re@ (blanket)