1Copyright 2000, 2001 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 3 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
17
18
19
20
21                      AMD K7 MPN SUBROUTINES
22
23
24This directory contains code optimized for the AMD Athlon CPU.
25
26The mmx subdirectory has routines using MMX instructions.  All Athlons have
27MMX, the separate directory is just so that configure can omit it if the
28assembler doesn't support MMX.
29
30
31
32STATUS
33
34Times for the loops, with all code and data in L1 cache.
35
36                               cycles/limb
37	mpn_add/sub_n             1.6
38
39	mpn_copyi                 0.75 or 1.0   \ varying with data alignment
40	mpn_copyd                 0.75 or 1.0   /
41
42	mpn_divrem_1             17.0 integer part, 15.0 fractional part
43	mpn_mod_1                17.0
44	mpn_divexact_by3          8.0
45
46	mpn_l/rshift              1.2
47
48	mpn_mul_1                 3.4
49	mpn_addmul/submul_1       3.9
50
51	mpn_mul_basecase          4.42 cycles/crossproduct (approx)
52        mpn_sqr_basecase          2.3 cycles/crossproduct (approx)
53				  or 4.55 cycles/triangleproduct (approx)
54
55Prefetching of sources hasn't yet been tried.
56
57
58
59NOTES
60
61cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
62
63Write-allocate L1 data cache means prefetching of destinations is unnecessary.
64
65Floating point multiplications can be done in parallel with integer
66multiplications, but there doesn't seem to be any way to make use of this.
67
68Unsigned "mul"s can be issued every 3 cycles.  This suggests 3 is a limit on
69the speed of the multiplication routines.  The documentation shows mul
70executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
71to get near 3 cycles code has to be arranged so that nothing else is issued
72to IEU0.  A busy IEU0 could explain why some code takes 4 cycles and other
73apparently equivalent code takes 5.
74
75
76
77OPTIMIZATIONS
78
79Unrolled loops are used to reduce looping overhead.  The unrolling is
80configurable up to 32 limbs/loop for most routines and up to 64 for some.
81The K7 has 64k L1 code cache so quite big unrolling is allowable.
82
83Computed jumps into the unrolling are used to handle sizes not a multiple of
84the unrolling.  An attractive feature of this is that times increase
85smoothly with operand size, but it may be that some routines should just
86have simple loops to finish up, especially when PIC adds between 2 and 16
87cycles to get %eip.
88
89Position independent code is implemented using a call to get %eip for the
90computed jumps and a ret is always done, rather than an addl $4,%esp or a
91popl, so the CPU return address branch prediction stack stays synchronised
92with the actual stack in memory.
93
94Branch prediction, in absence of any history, will guess forward jumps are
95not taken and backward jumps are taken.  Where possible it's arranged that
96the less likely or less important case is under a taken forward jump.
97
98
99
100CODING
101
102Instructions in general code have been shown grouped if they can execute
103together, which means up to three direct-path instructions which have no
104successive dependencies.  K7 always decodes three and has out-of-order
105execution, but the groupings show what slots might be available and what
106dependency chains exist.
107
108When there's vector-path instructions an effort is made to get triplets of
109direct-path instructions in between them, even if there's dependencies,
110since this maximizes decoding throughput and might save a cycle or two if
111decoding is the limiting factor.
112
113
114
115INSTRUCTIONS
116
117adcl       direct
118divl       39 cycles back-to-back
119lodsl,etc  vector
120loop       1 cycle vector (decl/jnz opens up one decode slot)
121movd reg   vector
122movd mem   direct
123mull       issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
124popl	   vector (use movl for more than one pop)
125pushl	   direct, will pair with a load
126shrdl %cl  vector, 3 cycles, seems to be 3 decode too
127xorl r,r   false read dependency recognised
128
129
130
131REFERENCES
132
133"AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
13422007, revision K, February 2002.  Available on-line,
135
136http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
137
138"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
139This describes the femms and prefetch instructions.  Available on-line,
140
141http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
142
143"AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
144publication number 22466, revision D, March 2000.  This describes
145instructions added in the Athlon processor, such as pswapd and the extra
146prefetch forms.  Available on-line,
147
148http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf
149
150"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
151August 1999.  This has some notes on general Athlon optimizations as well as
1523DNow.  Available on-line,
153
154http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
155
156
157
158
159----------------
160Local variables:
161mode: text
162fill-column: 76
163End:
164