1Copyright 2000, 2001 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 3 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
17
18
19
20
21
22                      INTEL P6 MPN SUBROUTINES
23
24
25
26This directory contains code optimized for Intel P6 class CPUs, meaning
27PentiumPro, Pentium II and Pentium III.  The mmx and p3mmx subdirectories
28have routines using MMX instructions.
29
30
31
32STATUS
33
34Times for the loops, with all code and data in L1 cache, are as follows.
35Some of these might be able to be improved.
36
37                               cycles/limb
38
39	mpn_add_n/sub_n           3.7
40
41	mpn_copyi                 0.75
42	mpn_copyd                 1.75 (or 0.75 if no overlap)
43
44	mpn_divrem_1             39.0
45	mpn_mod_1                21.5
46	mpn_divexact_by3          8.5
47
48	mpn_mul_1                 5.5
49	mpn_addmul/submul_1       6.35
50
51	mpn_l/rshift              2.5
52
53	mpn_mul_basecase          8.2 cycles/crossproduct (approx)
54	mpn_sqr_basecase          4.0 cycles/crossproduct (approx)
55	                          or 7.75 cycles/triangleproduct (approx)
56
57Pentium II and III have MMX and get the following improvements.
58
59	mpn_divrem_1             25.0 integer part, 17.5 fractional part
60
61	mpn_l/rshift              1.75
62
63
64
65
66NOTES
67
68Write-allocate L1 data cache means prefetching of destinations is unnecessary.
69
70Mispredicted branches have a penalty of between 9 and 15 cycles, and even up
71to 26 cycles depending how far speculative execution has gone.  The 9 cycle
72minimum penalty comes from the issue pipeline being 9 stages.
73
74A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4,
755, 6 or 7 limb operations are all the same.  The 0.75 cycles/limb would be 3
76cycles per 16 byte block.
77
78
79
80
81CODING
82
83Instructions in general code have been shown grouped if they can execute
84together, which means up to three instructions with no successive
85dependencies, and with only the first being a multiple micro-op.
86
87P6 has out-of-order execution, so the groupings are really only showing
88dependent paths where some shuffling might allow some latencies to be
89hidden.
90
91
92
93
94REFERENCES
95
96"Intel Architecture Optimization Reference Manual", 1999, revision 001 dated
9702/99, order number 245127 (order number 730795-001 is in the document too).
98Available on-line:
99
100	http://download.intel.com/design/PentiumII/manuals/245127.htm
101
102"Intel Architecture Optimization Manual", 1997, order number 242816.  This
103is an older document mostly about P5 and not as good as the above.
104Available on-line:
105
106	http://download.intel.com/design/PentiumII/manuals/242816.htm
107
108
109
110----------------
111Local variables:
112mode: text
113fill-column: 76
114End:
115