1Copyright 2000, 2001 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 3 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
17
18
19
20
21			AMD K6 MPN SUBROUTINES
22
23
24
25This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
26K6-3.
27
28The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
29has MMX code suiting K6-2 and K6-3.  All chips in the K6 family have MMX,
30the separate directories are just so that ./configure can omit them if the
31assembler doesn't support MMX.
32
33
34
35
36STATUS
37
38Times for the loops, with all code and data in L1 cache, are as follows.
39
40                                 cycles/limb
41
42	mpn_add_n/sub_n            3.25 normal, 2.75 in-place
43
44	mpn_mul_1                  6.25
45	mpn_add/submul_1           7.65-8.4  (varying with data values)
46
47	mpn_mul_basecase           9.25 cycles/crossproduct (approx)
48	mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
49                                   or 9.2 cycles/triangleproduct (approx)
50
51	mpn_l/rshift               3.0
52
53	mpn_divrem_1              20.0
54	mpn_mod_1                 20.0
55	mpn_divexact_by3          11.0
56
57	mpn_copyi                  1.0
58	mpn_copyd                  1.0
59
60
61K6-2 and K6-3 have dual-issue MMX and get the following improvements.
62
63	mpn_l/rshift               1.75
64
65
66Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
67instruction, code seems to run slower, and with just "mov" loads it doesn't
68seem faster.  Results so far are inconsistent.  The K6 does a hardware
69prefetch of the second cache line in a sector, so the penalty for not
70prefetching in software is reduced.
71
72
73
74
75NOTES
76
77All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
78
79Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
80execute them in both X and Y (and in both together).
81
82Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
83chapter 6 table 12).
84
85Write-allocate L1 data cache means prefetching of destinations is unnecessary.
86Store queue is 7 entries of 64 bits each.
87
88Floating point multiplications can be done in parallel with integer
89multiplications, but there doesn't seem to be any way to make use of this.
90
91
92
93OPTIMIZATIONS
94
95Unrolled loops are used to reduce looping overhead.  The unrolling is
96configurable up to 32 limbs/loop for most routines, up to 64 for some.
97
98Sometimes computed jumps into the unrolling are used to handle sizes not a
99multiple of the unrolling.  An attractive feature of this is that times
100smoothly increase with operand size, but an indirect jump is about 6 cycles
101and the setups about another 6, so it depends on how much the unrolled code
102is faster than a simple loop as to whether a computed jump ought to be used.
103
104Position independent code is implemented using a call to get eip for
105computed jumps and a ret is always done, rather than an addl $4,%esp or a
106popl, so the CPU return address branch prediction stack stays synchronised
107with the actual stack in memory.  Such a call however still costs 4 to 7
108cycles.
109
110Branch prediction, in absence of any history, will guess forward jumps are
111not taken and backward jumps are taken.  Where possible it's arranged that
112the less likely or less important case is under a taken forward jump.
113
114
115
116MMX
117
118Putting emms or femms as late as possible in a routine seems to be fastest.
119Perhaps an emms or femms stalls until all outstanding MMX instructions have
120completed, so putting it later gives them a chance to complete on their own,
121in parallel with other operations (like register popping).
122
123The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
124at the start of a routine, in case it's been preceded by x87 floating point
125operations.  This isn't done because in gmp programs it's expected that x87
126floating point won't be much used and that chances are an mpn routine won't
127have been preceded by any x87 code.
128
129
130
131CODING
132
133Instructions in general code are shown paired if they can decode and execute
134together, meaning two short decode instructions with the second not
135depending on the first, only the first using the shifter, no more than one
136load, and no more than one store.
137
138K6 does some out of order execution so the pairings aren't essential, they
139just show what slots might be available.  When decoding is the limiting
140factor things can be scheduled that might not execute until later.
141
142
143
144NOTES
145
146Code alignment
147
148- if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
149  short decode is inhibited.  The cross.pl script detects this.
150
151- loops and branch targets should be aligned to 16 bytes, or ensure at least
152  2 instructions before a 32 byte boundary.  This makes use of the 16 byte
153  cache in the BTB.
154
155Addressing modes
156
157- (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
158  problem, and can be used as an equivalent, or easier is just to use a
159  different register, like %ebx.
160
161- K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
162  have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
163
164  If more than 3 bytes are needed to determine instruction length then
165  decoding degrades from direct to long, or from long to vector.  This
166  happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
167  with mod=00 the sib determines whether there's a displacement.
168
169  This affects all MMX and 3DNow instructions, and others with an 0F prefix,
170  like movzbl.  The modes affected are anything with an index and no
171  displacement, or an index but no base, and this includes (%esp) which is
172  really (,%esp,1).
173
174  The cross.pl script detects problem cases.  The workaround is to always
175  use a displacement, and to do this with Zdisp if it's zero so the
176  assembler doesn't discard it.
177
178  See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
179  13-14 and 36-37.
180
181Calls
182
183- indirect jumps and calls are not branch predicted, they measure about 6
184  cycles.
185
186Various
187
188- adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
189- bsf       12-27 cycles
190- emms      5 cycles
191- femms     3 cycles
192- jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
193- divl      20 cycles back-to-back
194- imull     2 decode, 3 execute
195- mull      2 decode, 3 execute (optimization manual decoding sample)
196- prefetch  2 cycles
197- rcll/rcrl implicit by one bit: 2 cycles
198            immediate or %cl count: 11 + 2 per bit for dword
199                                    13 + 4 per bit for byte
200- setCC	    2 cycles
201- xchgl	%eax,reg  1.5 cycles, back-to-back (strange)
202        reg,reg   2 cycles, back-to-back
203
204
205
206
207REFERENCES
208
209"AMD-K6 Processor Code Optimization Application Note", AMD publication
210number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
211K6-3.  Available on-line,
212
213http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf
214
215"AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
216publication number 21828, revision A amendment 0, August 1997.  This is an
217older edition of the above document, describing plain K6.  Available
218on-line,
219
220http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf
221
222"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
223This describes the femms and prefetch instructions, but nothing else from
2243DNow has been used.  Available on-line,
225
226http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
227
228"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
229August 1999.  This has some notes on general K6 optimizations as well as
2303DNow.  Available on-line,
231
232http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
233
234
235
236----------------
237Local variables:
238mode: text
239fill-column: 76
240End:
241