sun4u/cpu/opl_olympus_copy.s

5  * Common Development and Distribution License (the "License").
11  * and limitations under the License.
14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
48  * 	! of copy and flags.  Set up error handling accordingly.
49  *	! The transition point depends on whether the src and
54  *	! For FP version, %l6 holds previous error handling and
57  *	! So either %l6 or %o4 is reserved and not available for
106  *			restore error handler and exit.
115  *			restore error handler and exit.
119  * ! method: line up src and dst as best possible, then
206  *	We've tried to restore fp state from the stack and failed.  To
216  * saving a register save and restore.  Also, less elaborate setup
218  * For longer copies, especially unaligned ones (where the src and
225  * moved whether the FP registers need to be saved, and some other
227  * 400 clocks.  Since each non-repeated/predicted tst and branch costs
229  * longer copies and only benefit a small portion of medium sized
244  * is more data and that data is not in cache, failing to prefetch
247  * The exact tradeoff is strongly load and application dependent, with
258  * hw_copy_limit_1 = src and dst are byte aligned but not halfword aligned
259  * hw_copy_limit_2 = src and dst are halfword aligned but not word aligned
260  * hw_copy_limit_4 = src and dst are word aligned but not longword aligned
261  * hw_copy_limit_8 = src and dst are longword aligned
263  * To say that src and dst are word aligned means that after
265  * both the src and dst will be on word boundaries so that
266  * word loads and stores may be used.
277  * If hw_copy_limit_? is set to a value between 1 and VIS_COPY_THRESHOLD (256)
280  * It is provided to allow for disabling FPBLK copies and to allow
287  * saves an alignment test, memory reference, and enabling test
291  * non-predicted tst and branch costs around 10 clocks.
292  * If src and dst are randomly selected addresses,
297  * But, tests on running kernels show that src and dst to copy code
298  * are typically not on random alignments.  Structure copies and
309  * We subdivide the non-FPBLK case further into CHKSIZE bytes and less
311  * align src and dst.  We try to minimize special case tests in
316  * src and dst alignment and provide special cases for each of
318  * to decide between short and medium size was chosen to be 39
320  * shift and 4 times 8 bytes for the first long word unrolling.
331  * and nops which are not executed in the code.  This
336  * instruction and the unrolled loops, then the alignment needs
341  * a non-predicted tst and branch takes 10 clocks, this savings
353  * three iterations later and shows a measured improvement
362  * Notes on preserving existing fp state and on membars.
366  * preserve - the rest of the kernel does not use fp and, anyway, fp
369  *	- userland has fp state and is interrupted (device interrupt
370  *	  or trap) and within the interrupt/trap handling we use
375  *	  userland or in kernel copy) and the tl0 component of the handling
377  *	- a user process with fp state incurs a copy-on-write fault and
381  * using our stack is ideal (and since fp copy cannot be leaf optimized
394  * ourselves and it is our cpu which will take any trap.
404  * and reboot the system (or restart the service with Greenline/Contracts).
408  * the event and the trap PC may not be the PC of the faulting access.
414  * is no need to repeat this), and we must force delivery of deferred
420  * Since the copy operations may preserve and later restore floating
425  * To make sure that floating point state is always saved and restored
431  *    use.  Bit 2 (TRAMP_FLAG) indicates that the call was to bcopy, and a
471  * Entry points bcopy, copyin_noerr, and copyout_noerr use this flag.
472  * kcopy, copyout, xcopyout, copyin, and xcopyin do not set this flag.
490  * floating-point register save area and 2 64-bit temp locations.
524  * Copy functions use either quadrants 1 and 3 or 2 and 4.
526  * FZEROQ1Q3: Zero quadrants 1 and 3, ie %f0 - %f15 and %f32 - %f47
527  * FZEROQ2Q4: Zero quadrants 2 and 4, ie %f16 - %f31 and %f48 - %f63
569  * Macros to save and restore quadrants 1 and 3 or 2 and 4 to/from the stack.
570  * Used to save and restore in-use fp registers when we want to use FP
571  * and find fp already in use and copy size still large enough to justify
572  * the additional overhead of this save and restore.
583  * original data, and a membar #Sync after restore lets the block loads
587  * and before using the BLD_*_FROMSTACK macro.
593 	and	tmp1, -VIS_BLOCKSIZE, tmp1 /* block align */	;\
602 	and	tmp1, -VIS_BLOCKSIZE, tmp1 /* block align */	;\
611 	and	tmp1, -VIS_BLOCKSIZE, tmp1 /* block align */	;\
620 	and	tmp1, -VIS_BLOCKSIZE, tmp1 /* block align */	;\
628  * FP_NOMIGRATE and FP_ALLOWMIGRATE.  Prevent migration (or, stronger,
630  * switch) before commencing a FP copy, and reallow it on completion or
782  * Errno value is in %g1.  bcopy_more uses fp quadrants 1 and 3.
790 	  and	%l6, TRAMP_FLAG, %l0		! copy trampoline flag to %l0
811 	! and bcopy. kcopy will *always* set a t_lofault handler
813 	! and *not* to invoke any existing error handler. As far as
880  * Assumes double word alignment and a count >= 256.
1089 !  Now long word aligned and have at least 32 bytes to move
1139 !  Now word aligned and have at least 36 bytes to move
1186 !  Now half word aligned and have at least 38 bytes to move
1216  * profiling and dtrace of the portions of the copy code that uses
1237 	! kcopy and bcopy use the same code path. If TRAMP_FLAG is set
1238 	! and the saved lofault was zero, we won't reset lofault on
1465 	  subcc	%o0, %o1, %o3		! difference of from and to address
1472 2:	cmp	%o2, %o3		! cmp size and abs(from - to)
1475 	  cmp	%o0, %o1		! compare from and to addresses
1512  * has already disabled kernel preemption and has checked
1642  * Transfer data to and from user space -
1647  * Note that copyin(9F) and copyout(9F) are part of the
1652  * So there's two extremely similar routines - xcopyin() and xcopyout()
1658  * There are also stub routines for xcopyout_little and xcopyin_little,
1669  * The only difference between copy{in,out} and
1689  * data copying algorithm and the default limits.
1987 !  Now long word aligned and have at least 32 bytes to move
2045 !  Now word aligned and have at least 36 bytes to move
2098 !  Now half word aligned and have at least 38 bytes to move
2152  * profiling and dtrace of the portions of the copy code that uses
2773 !  Now long word aligned and have at least 32 bytes to move
2828 !  Now word aligned and have at least 36 bytes to move
2880 !  Now half word aligned and have at least 38 bytes to move
2931  * profiling and dtrace of the portions of the copy code that uses
3527  * and returns 1.  Otherwise 0 is returned indicating success.
3528  * Caller is responsible for ensuring use_hw_bzero is true and that
3554 	! ... and must be 256 bytes or more
3559 	! ... and length must be a multiple of VIS_BLOCKSIZE
3579 	and	%l1, -VIS_BLOCKSIZE, %l1