1Copyright 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 3 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
17
18
19
20                      IA-64 MPN SUBROUTINES
21
22
23This directory contains mpn functions for the IA-64 architecture.
24
25
26CODE ORGANIZATION
27
28	mpn/ia64          itanium-2, and generic ia64
29
30The code here has been optimized primarily for Itanium 2.  Very few Itanium 1
31chips were ever sold, and Itanium 2 is more powerful, so the latter is what
32we concentrate on.
33
34
35
36CHIP NOTES
37
38The IA-64 ISA keeps instructions three and three in 128 bit bundles.
39Programmers/compilers need to put explicit breaks `;;' when there are WAW or
40RAW dependencies, with some notable exceptions.  Such "breaks" are typically
41at the end of a bundle, but can be put between operations within some bundle
42types too.
43
44The Itanium 1 and Itanium 2 implementations can under ideal conditions
45execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions
46to do integer operations, while the Itanium 2 allows all 6 to be integer
47operations.
48
49Taken cloop branches seem to insert a bubble into the pipeline most of the
50time on Itanium 1.
51
52Loads to the fp registers bypass the L1 cache and thus get extremely long
53latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
54
55The software pipeline stuff using br.ctop instruction causes delays, since
56many issue slots are taken up by instructions with zero predicates, and
57since many extra instructions are needed to set things up.  These features
58are clearly designed for code density, not speed.
59
60Misc pipeline limitations (Itanium 1):
61* The getf.sig instruction can only execute in M0.
62* At most four integer instructions/cycle.
63* Nops take up resources like any plain instructions.
64
65Misc pipeline limitations (Itanium 2):
66* The getf.sig instruction can only execute in M0.
67* Nops take up resources like any plain instructions.
68
69
70ASSEMBLY SYNTAX
71
72.align pads with nops in a text segment, but gas 2.14 and earlier
73incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
74it come out as break instructions.  We use the ALIGN() macro in
75mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro
76suppresses any .align if the problem is detected by configure.  Lack of
77alignment might hurt performance but will at least be correct.
78
79foo:: to create a global symbol is not accepted by gas.  Use separate
80".global foo" and "foo:" instead.
81
82.global is the standard global directive.  gas accepts .globl, but hpux "as"
83doesn't.
84
85.proc / .endp generates the appropriate .type and .size information for ELF,
86so the latter directives don't need to be given explicitly.
87
88.pred.rel "mutex"... is standard for annotating predicate register
89relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
90
91.pred directives can't be put on a line with a label, like
92".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
93gas is happy with it, and past versions of HP had seemed ok.
94
95// is the standard comment sequence, but we prefer "C" since it inhibits m4
96macro expansion.  See comments in ia64-defs.m4.
97
98
99REGISTER USAGE
100
101Special:
102   r0: constant 0
103   r1: global pointer (gp)
104   r8: return value
105   r12: stack pointer (sp)
106   r13: thread pointer (tp)
107Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
108Caller-saves but rotating: r32-
109
110
111================================================================
112mpn_add_n, mpn_sub_n:
113
114The current code runs at 1.25 c/l on Itanium 2.
115
116================================================================
117mpn_mul_1:
118
119The current code runs at 2 c/l on Itanium 2.
120
121Using a blocked approach, working off of 4 separate places in the operands,
122one could make use of the xma accumulation, and approach 1 c/l.
123
124	ldf8 [up]
125	xma.l
126	xma.hu
127	stf8  [wrp]
128
129================================================================
130mpn_addmul_1:
131
132The current code runs at 2 c/l on Itanium 2.
133
134It seems possible to use a blocked approach, as with mpn_mul_1.  We should
135read rp[] to integer registers, allowing for just one getf.sig per cycle.
136
137	ld8  [rp]
138	ldf8 [up]
139	xma.l
140	xma.hu
141	getf.sig
142	add+add+cmp+cmp
143	st8  [wrp]
144
145These 10 instructions can be scheduled to approach 1.667 cycles, and with
146the 4 cycle latency of xma, this means we need at least 3 blocks.  Using
147ldfp8 we could approach 1.583 c/l.
148
149================================================================
150mpn_submul_1:
151
152The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires
153ldfp8 with all alignment headache that implies.
154
155================================================================
156mpn_addmul_N
157
158For best speed, we need to give up using mpn_addmul_1 as the main multiply
159building block, and instead take multiple v limbs per loop.  For the Itanium
1601, we need to take about 8 limbs at a time for full speed.  For the Itanium
1612, something like mpn_addmul_4 should be enough.
162
163The add+cmp+cmp+add we use on the other codes is optimal for shortening
164recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When
165recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
166better.
167
168/* First load the 8 values from v */
169	ldfp8		v0, v1 = [r35], 16;;
170	ldfp8		v2, v3 = [r35], 16;;
171	ldfp8		v4, v5 = [r35], 16;;
172	ldfp8		v6, v7 = [r35], 16;;
173
174/* In the inner loop, get a new U limb and store a result limb. */
175	mov		lc = un
176Loop:	ldf8		u0 = [r33], 8
177	ld8		r0 = [r32]
178	xma.l		lp0 = v0, u0, hp0
179	xma.hu		hp0 = v0, u0, hp0
180	xma.l		lp1 = v1, u0, hp1
181	xma.hu		hp1 = v1, u0, hp1
182	xma.l		lp2 = v2, u0, hp2
183	xma.hu		hp2 = v2, u0, hp2
184	xma.l		lp3 = v3, u0, hp3
185	xma.hu		hp3 = v3, u0, hp3
186	xma.l		lp4 = v4, u0, hp4
187	xma.hu		hp4 = v4, u0, hp4
188	xma.l		lp5 = v5, u0, hp5
189	xma.hu		hp5 = v5, u0, hp5
190	xma.l		lp6 = v6, u0, hp6
191	xma.hu		hp6 = v6, u0, hp6
192	xma.l		lp7 = v7, u0, hp7
193	xma.hu		hp7 = v7, u0, hp7
194	getf.sig	l0 = lp0
195	getf.sig	l1 = lp1
196	getf.sig	l2 = lp2
197	getf.sig	l3 = lp3
198	getf.sig	l4 = lp4
199	getf.sig	l5 = lp5
200	getf.sig	l6 = lp6
201	add+cmp+add	xx, l0, r0
202	add+cmp+add	acc0, acc1, l1
203	add+cmp+add	acc1, acc2, l2
204	add+cmp+add	acc2, acc3, l3
205	add+cmp+add	acc3, acc4, l4
206	add+cmp+add	acc4, acc5, l5
207	add+cmp+add	acc5, acc6, l6
208	getf.sig	acc6 = lp7
209	st8		[r32] = xx, 8
210	br.cloop Loop
211
212	49 insn at max 6 insn/cycle:		8.167 cycles/limb8
213	11 memops at max 2 memops/cycle:	5.5 cycles/limb8
214	16 fpops at max 2 fpops/cycle:		8 cycles/limb8
215	21 intops at max 4 intops/cycle:	5.25 cycles/limb8
216	11+21 memops+intops at max 4/cycle	8 cycles/limb8
217
218================================================================
219mpn_lshift, mpn_rshift
220
221The current code runs at 1 cycle/limb on Itanium 2.
222
223Using 63 separate loops, we could use the double-word shrp instruction.
224That instruction has a plain single-cycle latency.  We need 63 loops since
225this instruction only accept immediate count.  That would lead to a somewhat
226silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
227each cycle plus shl/shr going down I1 for a further limb every second
228cycle).
229
230================================================================
231mpn_copyi, mpn_copyd
232
233The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1
234cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use
235scheduling isn't great.  It might be best to actually use modulo scheduled
236loops, since that will allow us to do better load-use scheduling without too
237much unrolling.
238
239Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
2402, according to tune/speed.  Cache bank conflicts?
241
242
243
244REFERENCES
245
246Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
247Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1
248includes an Itanium optimization guide.
249
250Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
251document 245370-003, May 2001.  Describes C type sizes, dynamic linking,
252etc.
253
254Intel Itanium Architecture Assembly Language Reference Guide, Intel document
255248801-004, 2000-2002.  Describes assembly instruction syntax and other
256directives.
257
258Itanium Software Conventions and Runtime Architecture Guide, Intel document
259245358-003, May 2001.  Describes calling conventions, including stack
260unwinding requirements.
261
262Intel Itanium Processor Reference Manual for Software Optimization, Intel
263document 245473-003, November 2001.
264
265Intel Itanium-2 Processor Reference Manual for Software Development and
266Optimization, Intel document 251110-003, May 2004.
267
268All the above documents can be found online at
269
270    http://developer.intel.com/design/itanium/manuals.htm
271