• Home
  • History
  • Annotate
  • only in this directory
NameDateSize

..22-Apr-201644

distfilesH A D27-Nov-20159

mmx/H27-Nov-20155

READMEH A D27-Nov-20153.9 KiB

sse2/H27-Nov-20158

README

1Copyright 2001 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 2.1 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
17the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
1802110-1301, USA.
19
20
21
22
23                   INTEL PENTIUM-4 MPN SUBROUTINES
24
25
26This directory contains mpn functions optimized for Intel Pentium-4.
27
28The mmx subdirectory has routines using MMX instructions, the sse2
29subdirectory has routines using SSE2 instructions.  All P4s have these, the
30separate directories are just so configure can omit that code if the
31assembler doesn't support it.
32
33
34STATUS
35
36                                cycles/limb
37
38	mpn_add_n/sub_n            4 normal, 6 in-place
39
40	mpn_mul_1                  4 normal, 6 in-place
41	mpn_addmul_1               6
42	mpn_submul_1               7
43
44	mpn_mul_basecase           6 cycles/crossproduct (approx)
45
46	mpn_sqr_basecase           3.5 cycles/crossproduct (approx)
47                                   or 7.0 cycles/triangleproduct (approx)
48
49	mpn_l/rshift               1.75
50
51
52
53The shifts ought to be able to go at 1.5 c/l, but not much effort has been
54applied to them yet.
55
56In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
57calls, suffer from pipeline anomalies associated with write combining and
58movd reads and writes to the same or nearby locations.  The movq
59instructions do not trigger the same hardware problems.  Unfortunately,
60using movq and splitting/combining seems to require too many extra
61instructions to help.  Perhaps future chip steppings will be better.
62
63
64
65NOTES
66
67The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
68Many traditional x86 instructions run very slowly, requiring use of
69alterative instructions for acceptable performance.
70
71adcl and sbbl are quite slow at 8 cycles for reg->reg.  paddq of 32-bits
72within a 64-bit mmx register seems better, though the combination
73paddq/psrlq when propagating a carry is still a 4 cycle latency.
74
75incl and decl should be avoided, instead use add $1 and sub $1.  Apparently
76the carry flag is not separately renamed, so incl and decl depend on all
77previous flags-setting instructions.
78
79shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
80integer instructions (addl, subl, orl, andl, and some more).  shldl and
81shrdl seem to have 13 and 15 cycles latency, respectively.  Bizarre.
82
83movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
84pxor/por or similar combination at 2 cycles latency can be used instead.
85The movq however executes in the float unit, thereby saving MMX execution
86resources.  With the right juggling, data moves shouldn't be on a dependent
87chain.
88
89L1 is write-through, but the write-combining sounds like it does enough to
90not require explicit destination prefetching.
91
92xmm registers so far haven't found a use, but not much effort has been
93expended.  A configure test for whether the operating system knows
94fxsave/fxrestor will be needed if they're used.
95
96
97
98REFERENCES
99
100Intel Pentium-4 processor manuals,
101
102	http://developer.intel.com/design/pentium4/manuals
103
104"Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
105order number 248966.  Available on-line:
106
107	http://developer.intel.com/design/pentium4/manuals/248966.htm
108
109
110
111----------------
112Local variables:
113mode: text
114fill-column: 76
115End:
116