1Copyright 1996, 1999, 2000, 2001, 2003 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of the GNU Lesser General Public License as published by 7the Free Software Foundation; either version 3 of the License, or (at your 8option) any later version. 9 10The GNU MP Library is distributed in the hope that it will be useful, but 11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 12or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 13License for more details. 14 15You should have received a copy of the GNU Lesser General Public License 16along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. 17 18 19 20 21 22 INTEL PENTIUM P5 MPN SUBROUTINES 23 24 25This directory contains mpn functions optimized for Intel Pentium (P5,P54) 26processors. The mmx subdirectory has additional code for Pentium with MMX 27(P55). 28 29 30STATUS 31 32 cycles/limb 33 34 mpn_add_n/sub_n 2.375 35 36 mpn_mul_1 12.0 37 mpn_add/submul_1 14.0 38 39 mpn_mul_basecase 14.2 cycles/crossproduct (approx) 40 41 mpn_sqr_basecase 8 cycles/crossproduct (approx) 42 or 15.5 cycles/triangleproduct (approx) 43 44 mpn_l/rshift 5.375 normal (6.0 on P54) 45 1.875 special shift by 1 bit 46 47 mpn_divrem_1 44.0 48 mpn_mod_1 28.0 49 mpn_divexact_by3 15.0 50 51 mpn_copyi/copyd 1.0 52 53Pentium MMX gets the following improvements 54 55 mpn_l/rshift 1.75 56 57 mpn_mul_1 12.0 normal, 7.0 for 16-bit multiplier 58 59 60mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop 61overhead and other delays (cache refill?), they run at or near 2.5 62cycles/limb. 63 64mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they 65should. Intel documentation says a mul instruction is 10 cycles, but it 66measures 9 and the routines using it run as 9. 67 68 69 70P55 MMX AND X87 71 72The cost of switching between MMX and x87 floating point on P55 is about 100 73cycles (fld1/por/emms for instance). In order to avoid that the two aren't 74mixed and currently that means using MMX and not x87. 75 76MMX offers a big speedup for lshift and rshift, and a nice speedup for 7716-bit multipliers in mpn_mul_1. If fast code using x87 is found then 78perhaps the preference for MMX will be reversed. 79 80 81 82 83P54 SHLDL 84 85mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the 86documentation indicates that they should take only 43/8 = 5.375 cycles/limb, 87or 5 cycles/limb asymptotically. The P55 runs them at the expected speed. 88 89It seems that on P54 a shldl or shrdl allows pairing in one following cycle, 90but not two. For example, back to back repetitions of the following 91 92 shldl( %cl, %eax, %ebx) 93 xorl %edx, %edx 94 xorl %esi, %esi 95 96run at 5 cycles, as expected, but repetitions of the following run at 7 97cycles, whereas 6 would be expected (and is achieved on P55), 98 99 shldl( %cl, %eax, %ebx) 100 xorl %edx, %edx 101 xorl %esi, %esi 102 xorl %edi, %edi 103 xorl %ebp, %ebp 104 105Three xorls run at 7 cycles too, so it doesn't seem to be just that pairing 106inhibited is only in the second following cycle (or something like that). 107 108Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a 109pattern of shift, 2 loads, shift, 2 stores, shift, etc. A start has been 110made on something like that, but it's not yet complete. 111 112 113 114 115OTHER NOTES 116 117Prefetching Destinations 118 119 Pentium doesn't allocate cache lines on writes, unlike most other modern 120 processors. Since the functions in the mpn class do array writes, we 121 have to handle allocating the destination cache lines by reading a word 122 from it in the loops, to achieve the best performance. 123 124Prefetching Sources 125 126 Prefetching of sources is pointless since there's no out-of-order loads. 127 Any load instruction blocks until the line is brought to L1, so it may 128 as well be the load that wants the data which blocks. 129 130Data Cache Bank Clashes 131 132 Pairing of memory operations requires that the two issued operations 133 refer to different cache banks (ie. different addresses modulo 32 134 bytes). The simplest way to ensure this is to read/write two words from 135 the same object. If we make operations on different objects, they might 136 or might not be to the same cache bank. 137 138PIC %eip Fetching 139 140 A simple call $+5 and popl can be used to get %eip, there's no need to 141 balance calls and returns since P5 doesn't have any return stack branch 142 prediction. 143 144Float Multiplies 145 146 fmul is pairable and can be issued every 2 cycles (with a 4 cycle 147 latency for data ready to use). This is a lot better than integer mull 148 or imull at 9 cycles non-pairing. Unfortunately the advantage is 149 quickly eaten away by needing to throw data through memory back to the 150 integer registers to adjust for fild and fist being signed, and to do 151 things like propagating carry bits. 152 153 154 155 156 157REFERENCES 158 159"Intel Architecture Optimization Manual", 1997, order number 242816. This 160is mostly about P5, the parts about P6 aren't relevant. Available on-line: 161 162 http://download.intel.com/design/PentiumII/manuals/242816.htm 163 164 165 166---------------- 167Local variables: 168mode: text 169fill-column: 76 170End: 171