1Copyright 2000, 2001 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of the GNU Lesser General Public License as published by 7the Free Software Foundation; either version 3 of the License, or (at your 8option) any later version. 9 10The GNU MP Library is distributed in the hope that it will be useful, but 11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 12or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 13License for more details. 14 15You should have received a copy of the GNU Lesser General Public License 16along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. 17 18 19 20 21 AMD K6 MPN SUBROUTINES 22 23 24 25This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and 26K6-3. 27 28The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory 29has MMX code suiting K6-2 and K6-3. All chips in the K6 family have MMX, 30the separate directories are just so that ./configure can omit them if the 31assembler doesn't support MMX. 32 33 34 35 36STATUS 37 38Times for the loops, with all code and data in L1 cache, are as follows. 39 40 cycles/limb 41 42 mpn_add_n/sub_n 3.25 normal, 2.75 in-place 43 44 mpn_mul_1 6.25 45 mpn_add/submul_1 7.65-8.4 (varying with data values) 46 47 mpn_mul_basecase 9.25 cycles/crossproduct (approx) 48 mpn_sqr_basecase 4.7 cycles/crossproduct (approx) 49 or 9.2 cycles/triangleproduct (approx) 50 51 mpn_l/rshift 3.0 52 53 mpn_divrem_1 20.0 54 mpn_mod_1 20.0 55 mpn_divexact_by3 11.0 56 57 mpn_copyi 1.0 58 mpn_copyd 1.0 59 60 61K6-2 and K6-3 have dual-issue MMX and get the following improvements. 62 63 mpn_l/rshift 1.75 64 65 66Prefetching of sources hasn't yet given any joy. With the 3DNow "prefetch" 67instruction, code seems to run slower, and with just "mov" loads it doesn't 68seem faster. Results so far are inconsistent. The K6 does a hardware 69prefetch of the second cache line in a sector, so the penalty for not 70prefetching in software is reduced. 71 72 73 74 75NOTES 76 77All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow. 78 79Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can 80execute them in both X and Y (and in both together). 81 82Branch misprediction penalty is 1 to 4 cycles (Optimization Manual 83chapter 6 table 12). 84 85Write-allocate L1 data cache means prefetching of destinations is unnecessary. 86Store queue is 7 entries of 64 bits each. 87 88Floating point multiplications can be done in parallel with integer 89multiplications, but there doesn't seem to be any way to make use of this. 90 91 92 93OPTIMIZATIONS 94 95Unrolled loops are used to reduce looping overhead. The unrolling is 96configurable up to 32 limbs/loop for most routines, up to 64 for some. 97 98Sometimes computed jumps into the unrolling are used to handle sizes not a 99multiple of the unrolling. An attractive feature of this is that times 100smoothly increase with operand size, but an indirect jump is about 6 cycles 101and the setups about another 6, so it depends on how much the unrolled code 102is faster than a simple loop as to whether a computed jump ought to be used. 103 104Position independent code is implemented using a call to get eip for 105computed jumps and a ret is always done, rather than an addl $4,%esp or a 106popl, so the CPU return address branch prediction stack stays synchronised 107with the actual stack in memory. Such a call however still costs 4 to 7 108cycles. 109 110Branch prediction, in absence of any history, will guess forward jumps are 111not taken and backward jumps are taken. Where possible it's arranged that 112the less likely or less important case is under a taken forward jump. 113 114 115 116MMX 117 118Putting emms or femms as late as possible in a routine seems to be fastest. 119Perhaps an emms or femms stalls until all outstanding MMX instructions have 120completed, so putting it later gives them a chance to complete on their own, 121in parallel with other operations (like register popping). 122 123The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3 124at the start of a routine, in case it's been preceded by x87 floating point 125operations. This isn't done because in gmp programs it's expected that x87 126floating point won't be much used and that chances are an mpn routine won't 127have been preceded by any x87 code. 128 129 130 131CODING 132 133Instructions in general code are shown paired if they can decode and execute 134together, meaning two short decode instructions with the second not 135depending on the first, only the first using the shifter, no more than one 136load, and no more than one store. 137 138K6 does some out of order execution so the pairings aren't essential, they 139just show what slots might be available. When decoding is the limiting 140factor things can be scheduled that might not execute until later. 141 142 143 144NOTES 145 146Code alignment 147 148- if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary, 149 short decode is inhibited. The cross.pl script detects this. 150 151- loops and branch targets should be aligned to 16 bytes, or ensure at least 152 2 instructions before a 32 byte boundary. This makes use of the 16 byte 153 cache in the BTB. 154 155Addressing modes 156 157- (%esi) degrades decoding from short to vector. 0(%esi) doesn't have this 158 problem, and can be used as an equivalent, or easier is just to use a 159 different register, like %ebx. 160 161- K6 and pre-CXT core K6-2 have the following problem. (K6-2 CXT and K6-3 162 have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F). 163 164 If more than 3 bytes are needed to determine instruction length then 165 decoding degrades from direct to long, or from long to vector. This 166 happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since 167 with mod=00 the sib determines whether there's a displacement. 168 169 This affects all MMX and 3DNow instructions, and others with an 0F prefix, 170 like movzbl. The modes affected are anything with an index and no 171 displacement, or an index but no base, and this includes (%esp) which is 172 really (,%esp,1). 173 174 The cross.pl script detects problem cases. The workaround is to always 175 use a displacement, and to do this with Zdisp if it's zero so the 176 assembler doesn't discard it. 177 178 See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages 179 13-14 and 36-37. 180 181Calls 182 183- indirect jumps and calls are not branch predicted, they measure about 6 184 cycles. 185 186Various 187 188- adcl 2 cycles of decode, maybe 2 cycles executing in the X pipe 189- bsf 12-27 cycles 190- emms 5 cycles 191- femms 3 cycles 192- jecxz 2 cycles taken, 13 not taken (optimization manual says 7 not taken) 193- divl 20 cycles back-to-back 194- imull 2 decode, 3 execute 195- mull 2 decode, 3 execute (optimization manual decoding sample) 196- prefetch 2 cycles 197- rcll/rcrl implicit by one bit: 2 cycles 198 immediate or %cl count: 11 + 2 per bit for dword 199 13 + 4 per bit for byte 200- setCC 2 cycles 201- xchgl %eax,reg 1.5 cycles, back-to-back (strange) 202 reg,reg 2 cycles, back-to-back 203 204 205 206 207REFERENCES 208 209"AMD-K6 Processor Code Optimization Application Note", AMD publication 210number 21924, revision D amendment 0, January 2000. This describes K6-2 and 211K6-3. Available on-line, 212 213http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf 214 215"AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD 216publication number 21828, revision A amendment 0, August 1997. This is an 217older edition of the above document, describing plain K6. Available 218on-line, 219 220http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf 221 222"3DNow Technology Manual", AMD publication number 21928G/0-March 2000. 223This describes the femms and prefetch instructions, but nothing else from 2243DNow has been used. Available on-line, 225 226http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf 227 228"3DNow Instruction Porting Guide", AMD publication number 22621, revision B, 229August 1999. This has some notes on general K6 optimizations as well as 2303DNow. Available on-line, 231 232http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf 233 234 235 236---------------- 237Local variables: 238mode: text 239fill-column: 76 240End: 241