sun4u/cpu/cheetah_copy.s

5  * Common Development and Distribution License, Version 1.0 only
12  * and limitations under the License.
15  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
52  * 	! of copy and flags.  Set up error handling accordingly.
53  *	! The transition point depends on whether the src and
58  *	! For FP version, %l6 holds previous error handling and
61  *	! So either %l6 or %o4 is reserved and not available for
110  *			restore error handler and exit.
119  *			restore error handler and exit.
123  * ! method: line up src and dst as best possible, then
210  *	We've tried to restore fp state from the stack and failed.  To
220  * saving a register save and restore.  Also, less elaborate setup
222  * For longer copies, especially unaligned ones (where the src and
229  * moved whether the FP registers need to be saved, and some other
231  * 400 clocks.  Since each non-repeated/predicted tst and branch costs
233  * longer copies and only benefit a small portion of medium sized
248  * is more data and that data is not in cache, failing to prefetch
251  * The exact tradeoff is strongly load and application dependent, with
262  * hw_copy_limit_1 = src and dst are byte aligned but not halfword aligned
263  * hw_copy_limit_2 = src and dst are halfword aligned but not word aligned
264  * hw_copy_limit_4 = src and dst are word aligned but not longword aligned
265  * hw_copy_limit_8 = src and dst are longword aligned
267  * To say that src and dst are word aligned means that after
269  * both the src and dst will be on word boundaries so that
270  * word loads and stores may be used.
273  * on Cheetah+ (900MHz), Cheetah++ (1200MHz), and Jaguar(1050MHz):
282  * If hw_copy_limit_? is set to a value between 1 and VIS_COPY_THRESHOLD (256)
285  * It is provided to allow for disabling FPBLK copies and to allow
292  * saves an alignment test, memory reference, and enabling test
296  * non-predicted tst and branch costs around 10 clocks.
297  * If src and dst are randomly selected addresses,
302  * But, tests on running kernels show that src and dst to copy code
303  * are typically not on random alignments.  Structure copies and
314  * We subdivide the non-FPBLK case further into CHKSIZE bytes and less
316  * align src and dst.  We try to minimize special case tests in
321  * src and dst alignment and provide special cases for each of
323  * to decide between short and medium size was chosen to be 39
325  * shift and 4 times 8 bytes for the first long word unrolling.
334  * branch instruction on Cheetah, Jaguar, and Panther, the
340  * and nops which are not executed in the code.  This
345  * instruction and the unrolled loops, then the alignment needs
350  * a non-predicted tst and branch takes 10 clocks, this savings
362  * three iterations later and shows a measured improvement
371  * Notes on preserving existing fp state and on membars.
375  * preserve - the rest of the kernel does not use fp and, anyway, fp
378  *	- userland has fp state and is interrupted (device interrupt
379  *	  or trap) and within the interrupt/trap handling we use
384  *	  userland or in kernel copy) and the tl0 component of the handling
386  *	- a user process with fp state incurs a copy-on-write fault and
390  * using our stack is ideal (and since fp copy cannot be leaf optimized
395  * nops (those semantics always apply) and #StoreLoad is implemented
418  * ourselves and it is our cpu which will take any trap.
428  * and reboot the system (or restart the service with Greenline/Contracts).
432  * the event and the trap PC may not be the PC of the faulting access.
438  * is no need to repeat this), and we must force delivery of deferred
444  * Since the copy operations may preserve and later restore floating
449  * To make sure that floating point state is always saved and restored
455  *    use.  Bit 2 (TRAMP_FLAG) indicates that the call was to bcopy, and a
495  * Entry points bcopy, copyin_noerr, and copyout_noerr use this flag.
496  * kcopy, copyout, xcopyout, copyin, and xcopyin do not set this flag.
504  * Testing with 1200 MHz Cheetah+ and Jaguar gives best results with
505  * two prefetches, one with a reach of 8*BLOCK_SIZE+8 and one with a
508  * for the improvement is that with Cheetah and Jaguar, some prefetches
522  * floating-point register save area and 2 64-bit temp locations.
556  * Copy functions use either quadrants 1 and 3 or 2 and 4.
558  * FZEROQ1Q3: Zero quadrants 1 and 3, ie %f0 - %f15 and %f32 - %f47
559  * FZEROQ2Q4: Zero quadrants 2 and 4, ie %f16 - %f31 and %f48 - %f63
601  * Macros to save and restore quadrants 1 and 3 or 2 and 4 to/from the stack.
602  * Used to save and restore in-use fp registers when we want to use FP
603  * and find fp already in use and copy size still large enough to justify
604  * the additional overhead of this save and restore.
618  * original data, and a membar #Sync after restore lets the block loads
622  * and before using the BLD_*_FROMSTACK macro.
628 	and	tmp1, -VIS_BLOCKSIZE, tmp1 /* block align */	;\
637 	and	tmp1, -VIS_BLOCKSIZE, tmp1 /* block align */	;\
646 	and	tmp1, -VIS_BLOCKSIZE, tmp1 /* block align */	;\
655 	and	tmp1, -VIS_BLOCKSIZE, tmp1 /* block align */	;\
663  * FP_NOMIGRATE and FP_ALLOWMIGRATE.  Prevent migration (or, stronger,
665  * switch) before commencing a FP copy, and reallow it on completion or
674  * CPU we perform the copy on and so that we know which CPU failed
676  * This could be achieved through disabling preemption (and we have do it that
827  * Errno value is in %g1.  bcopy_more uses fp quadrants 1 and 3.
835 	  and	%l6, TRAMP_FLAG, %l0		! copy trampoline flag to %l0
856 	! and bcopy. kcopy will *always* set a t_lofault handler
858 	! and *not* to invoke any existing error handler. As far as
925  * Assumes double word alignment and a count >= 256.
1138 !  Now long word aligned and have at least 32 bytes to move
1188 !  Now word aligned and have at least 36 bytes to move
1235 !  Now half word aligned and have at least 38 bytes to move
1265  * profiling and dtrace of the portions of the copy code that uses
1285 	! kcopy and bcopy use the same code path. If TRAMP_FLAG is set
1286 	! and the saved lofault was zero, we won't reset lofault on
1523 	  subcc	%o0, %o1, %o3		! difference of from and to address
1530 2:	cmp	%o2, %o3		! cmp size and abs(from - to)
1533 	  cmp	%o0, %o1		! compare from and to addresses
1570  * has already disabled kernel preemption and has checked
1710  * Transfer data to and from user space -
1715  * Note that copyin(9F) and copyout(9F) are part of the
1720  * So there's two extremely similar routines - xcopyin() and xcopyout()
1726  * There are also stub routines for xcopyout_little and xcopyin_little,
1737  * The only difference between copy{in,out} and
1757  * data copying algorithm and the default limits.
2055 !  Now long word aligned and have at least 32 bytes to move
2113 !  Now word aligned and have at least 36 bytes to move
2166 !  Now half word aligned and have at least 38 bytes to move
2220  * profiling and dtrace of the portions of the copy code that uses
2850 !  Now long word aligned and have at least 32 bytes to move
2905 !  Now word aligned and have at least 36 bytes to move
2957 !  Now half word aligned and have at least 38 bytes to move
3008  * profiling and dtrace of the portions of the copy code that uses
3613  * and returns 1.  Otherwise 0 is returned indicating success.
3614  * Caller is responsible for ensuring use_hw_bzero is true and that
3640 	! ... and must be 256 bytes or more
3645 	! ... and length must be a multiple of VIS_BLOCKSIZE
3665 	and	%l1, -VIS_BLOCKSIZE, %l1