NameDateSize

..13-May-201340

add_n.asmH A D22-Nov-20124.4 KiB

addmul_1.asmH A D22-Nov-20122.4 KiB

alpha-defs.m4H A D22-Nov-20122.8 KiB

aorslsh1_n.asmH A D22-Nov-20124.6 KiB

bdiv_dbm1c.asmH A D22-Nov-20125.3 KiB

cntlz.asmH A D22-Nov-20121.3 KiB

com.asmH A D22-Nov-20123.9 KiB

copyd.asmH A D22-Nov-20122 KiB

copyi.asmH A D22-Nov-20121.9 KiB

default.m4H A D22-Nov-20122.6 KiB

dive_1.cH A D22-Nov-20122.6 KiB

divrem_2.asmH A D22-Nov-20124.1 KiB

ev5/H13-May-20134

ev6/H13-May-201310

ev67/H22-Nov-20125

gmp-mparam.hH A D22-Nov-20122.9 KiB

invert_limb.asmH A D22-Nov-201217.8 KiB

lshift.asmH A D22-Nov-20123.2 KiB

mod_34lsub1.asmH A D22-Nov-20123.2 KiB

mode1o.asmH A D22-Nov-20125.1 KiB

mul_1.asmH A D22-Nov-20122.6 KiB

READMEH A D22-Nov-20127.5 KiB

rshift.asmH A D22-Nov-20123 KiB

sqr_diagonal.asmH A D22-Nov-20121.7 KiB

sub_n.asmH A D22-Nov-20124.7 KiB

submul_1.asmH A D22-Nov-20122.4 KiB

umul.asmH A D22-Nov-20121.1 KiB

unicos.m4H A D22-Nov-20122.8 KiB

README

1Copyright 1996, 1997, 1999, 2000, 2001, 2002, 2003, 2004, 2005 Free Software
2Foundation, Inc.
3
4This file is part of the GNU MP Library.
5
6The GNU MP Library is free software; you can redistribute it and/or modify it
7under the terms of the GNU Lesser General Public License as published by the
8Free Software Foundation; either version 3 of the License, or (at your
9option) any later version.
10
11The GNU MP Library is distributed in the hope that it will be useful, but
12WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
13FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
14for more details.
15
16You should have received a copy of the GNU Lesser General Public License along
17with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
18
19
20
21
22
23This directory contains mpn functions optimized for DEC Alpha processors.
24
25ALPHA ASSEMBLY RULES AND REGULATIONS
26
27The `.prologue N' pseudo op marks the end of instruction that needs special
28handling by unwinding.  It also says whether $27 is really needed for computing
29the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
30and at what offset in the frame.
31
32Cray T3 code is very very different...
33
34"$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6"
35/ "f6" is required.  We use the "r6" / "f6" forms, and have m4 defines expand
36them to "$6" or "$f6" where necessary.
37
38"0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is
39required.  The X() macro accommodates this difference.
40
41"cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will
42accept either.  We use cvttqc and have an m4 define expand to cvttq/c where
43necessary.
44
45"not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not
46the Unicos assembler.  The full "ornot" must be used.
47
48"unop" is not available in Unicos.  We make an m4 define to the usual "ldq_u
49r31,0(r30)", and in fact use that define on all systems since it comes out the
50same.
51
52"!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not
53available in older alpha assemblers (including gas prior to 2.12), according to
54the GCC manual, so the assembler macro forms must be used (eg. ldgp).
55
56
57
58RELEVANT OPTIMIZATION ISSUES
59
60EV4
61
621. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
63   through, and a cache line is transferred from the store buffer to the off-
64   chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
65   mpn_sub_n, mpn_lshift, and mpn_rshift.
66
672. Pairing is possible between memory instructions and integer arithmetic
68   instructions.
69
703. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
71   cycles are pipelined.  Thus, multiply instructions can be issued at a rate
72   of one each 21st cycle.
73
74EV5
75
761. The memory bandwidth of this chip is good, both for loads and stores.  The
77   L1 cache can handle two loads or one store per cycle, but two cycles after a
78   store, no ld can issue.
79
802. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
81   umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
82   (Note that published documentation gets these numbers slightly wrong.)
83
843. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
85   are memory operations.  This will take at least
86	ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
87   We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
88   cache cycles, which should be completely hidden in the 19 issue cycles.
89   The computation is inherently serial, with these dependencies:
90
91	       ldq  ldq
92		 \  /\
93	  (or)   addq |
94	   |\   /   \ |
95	   | addq  cmpult
96	    \  |     |
97	     cmpult  |
98		 \  /
99		  or
100
101   I.e., 3 operations are needed between carry-in and carry-out, making 12
102   cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
103   a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
104   might waste a cycle on EV4.  The total depth remain unaffected, since cmov
105   has a latency of 2 cycles.
106
107     addq
108     /   \
109   addq  cmpult
110     |      \
111   cmpult -> cmovne
112
113  Montgomery has a slightly different way of computing carry that requires one
114  less instruction, but has depth 4 (instead of the current 3).  Since the code
115  is currently instruction issue bound, Montgomery's idea should save us 1/2
116  cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
117  Unfortunately, this method will not be good for the EV6.
118
1194. addmul_1 and friends: We previously had a scheme for splitting the single-
120   limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
121   and then use FP operations for every 2nd multiply, and integer operations
122   for every 2nd multiply.
123
124   But it seems much better to split the single-limb operand in 16-bit chunks,
125   since we save many integer shifts and adds that way.  See powerpc64/README
126   for some more details.
127
128EV6
129
130Here we have a really parallel pipeline, capable of issuing up to 4 integer
131instructions per cycle.  In actual practice, it is never possible to sustain
132more than 3.5 integer insns/cycle due to rename register shortage.  One integer
133multiply instruction can issue each cycle.  To get optimal speed, we need to
134pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
135
136There are two dependencies to watch out for.  1) Address arithmetic
137dependencies, and 2) carry propagation dependencies.
138
139We can avoid serializing due to address arithmetic by unrolling loops, so that
140addresses don't depend heavily on an index variable.  Avoiding serializing
141because of carry propagation is trickier; the ultimate performance of the code
142will be determined of the number of latency cycles it takes from accepting
143carry-in to a vector point until we can generate carry-out.
144
145Most integer instructions can execute in either the L0, U0, L1, or U1
146pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
147
148CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
149split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
150should always be placed as the last instruction of an aligned 4 instruction
151block, or perhaps simply avoided.
152
153Perhaps the most important issue is the latency between the L0/U0 and L1/U1
154clusters; a result obtained on either cluster has an extra cycle of latency for
155consumers in the opposite cluster.  Because of the dynamic nature of the
156implementation, it is hard to predict where an instruction will execute.
157
158
159
160REFERENCES
161
162"Alpha Architecture Handbook", version 4, Compaq, October 1998, order number
163EC-QD2KC-TE.
164
165"Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998,
166order number EC-QP99C-TE.
167
168"Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4,
169Compaq, September 2000, order number DS-0028B-TE.
170
171"Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number
172EC-RJ66A-TE.
173
174All of the above are available online from
175
176  http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html
177  ftp://ftp.compaq.com/pub/products/alphaCPUdocs
178
179"Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part
180number AA-PS31D-TE.
181
182"Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp,
183March 1996, part number AA-PY8AC-TE.
184
185The above are available online,
186
187  http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM
188
189(Dunno what h30097 means in this URL, but if it moves try searching for "tru64
190online documentation" from the main www.hp.com page.)
191
192
193
194----------------
195Local variables:
196mode: text
197fill-column: 79
198End:
199