1Copyright 1996, 1999, 2000, 2001, 2003 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 3 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
17
18
19
20
21
22                   INTEL PENTIUM P5 MPN SUBROUTINES
23
24
25This directory contains mpn functions optimized for Intel Pentium (P5,P54)
26processors.  The mmx subdirectory has additional code for Pentium with MMX
27(P55).
28
29
30STATUS
31
32                                cycles/limb
33
34	mpn_add_n/sub_n            2.375
35
36	mpn_mul_1                 12.0
37	mpn_add/submul_1          14.0
38
39	mpn_mul_basecase          14.2 cycles/crossproduct (approx)
40
41	mpn_sqr_basecase           8 cycles/crossproduct (approx)
42                                   or 15.5 cycles/triangleproduct (approx)
43
44	mpn_l/rshift               5.375 normal (6.0 on P54)
45				   1.875 special shift by 1 bit
46
47	mpn_divrem_1              44.0
48	mpn_mod_1                 28.0
49	mpn_divexact_by3          15.0
50
51	mpn_copyi/copyd            1.0
52
53Pentium MMX gets the following improvements
54
55	mpn_l/rshift               1.75
56
57	mpn_mul_1                 12.0 normal, 7.0 for 16-bit multiplier
58
59
60mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb.  Due to loop
61overhead and other delays (cache refill?), they run at or near 2.5
62cycles/limb.
63
64mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
65should.  Intel documentation says a mul instruction is 10 cycles, but it
66measures 9 and the routines using it run as 9.
67
68
69
70P55 MMX AND X87
71
72The cost of switching between MMX and x87 floating point on P55 is about 100
73cycles (fld1/por/emms for instance).  In order to avoid that the two aren't
74mixed and currently that means using MMX and not x87.
75
76MMX offers a big speedup for lshift and rshift, and a nice speedup for
7716-bit multipliers in mpn_mul_1.  If fast code using x87 is found then
78perhaps the preference for MMX will be reversed.
79
80
81
82
83P54 SHLDL
84
85mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the
86documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
87or 5 cycles/limb asymptotically.  The P55 runs them at the expected speed.
88
89It seems that on P54 a shldl or shrdl allows pairing in one following cycle,
90but not two.  For example, back to back repetitions of the following
91
92	shldl(	%cl, %eax, %ebx)
93	xorl	%edx, %edx
94	xorl	%esi, %esi
95
96run at 5 cycles, as expected, but repetitions of the following run at 7
97cycles, whereas 6 would be expected (and is achieved on P55),
98
99	shldl(	%cl, %eax, %ebx)
100	xorl	%edx, %edx
101	xorl	%esi, %esi
102	xorl	%edi, %edi
103	xorl	%ebp, %ebp
104
105Three xorls run at 7 cycles too, so it doesn't seem to be just that pairing
106inhibited is only in the second following cycle (or something like that).
107
108Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a
109pattern of shift, 2 loads, shift, 2 stores, shift, etc.  A start has been
110made on something like that, but it's not yet complete.
111
112
113
114
115OTHER NOTES
116
117Prefetching Destinations
118
119    Pentium doesn't allocate cache lines on writes, unlike most other modern
120    processors.  Since the functions in the mpn class do array writes, we
121    have to handle allocating the destination cache lines by reading a word
122    from it in the loops, to achieve the best performance.
123
124Prefetching Sources
125
126    Prefetching of sources is pointless since there's no out-of-order loads.
127    Any load instruction blocks until the line is brought to L1, so it may
128    as well be the load that wants the data which blocks.
129
130Data Cache Bank Clashes
131
132    Pairing of memory operations requires that the two issued operations
133    refer to different cache banks (ie. different addresses modulo 32
134    bytes).  The simplest way to ensure this is to read/write two words from
135    the same object.  If we make operations on different objects, they might
136    or might not be to the same cache bank.
137
138PIC %eip Fetching
139
140    A simple call $+5 and popl can be used to get %eip, there's no need to
141    balance calls and returns since P5 doesn't have any return stack branch
142    prediction.
143
144Float Multiplies
145
146    fmul is pairable and can be issued every 2 cycles (with a 4 cycle
147    latency for data ready to use).  This is a lot better than integer mull
148    or imull at 9 cycles non-pairing.  Unfortunately the advantage is
149    quickly eaten away by needing to throw data through memory back to the
150    integer registers to adjust for fild and fist being signed, and to do
151    things like propagating carry bits.
152
153
154
155
156
157REFERENCES
158
159"Intel Architecture Optimization Manual", 1997, order number 242816.  This
160is mostly about P5, the parts about P6 aren't relevant.  Available on-line:
161
162        http://download.intel.com/design/PentiumII/manuals/242816.htm
163
164
165
166----------------
167Local variables:
168mode: text
169fill-column: 76
170End:
171