1Copyright 2001 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of the GNU Lesser General Public License as published by 7the Free Software Foundation; either version 2.1 of the License, or (at your 8option) any later version. 9 10The GNU MP Library is distributed in the hope that it will be useful, but 11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 12or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 13License for more details. 14 15You should have received a copy of the GNU Lesser General Public License 16along with the GNU MP Library; see the file COPYING.LIB. If not, write to 17the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 1802110-1301, USA. 19 20 21 22 23 INTEL PENTIUM-4 MPN SUBROUTINES 24 25 26This directory contains mpn functions optimized for Intel Pentium-4. 27 28The mmx subdirectory has routines using MMX instructions, the sse2 29subdirectory has routines using SSE2 instructions. All P4s have these, the 30separate directories are just so configure can omit that code if the 31assembler doesn't support it. 32 33 34STATUS 35 36 cycles/limb 37 38 mpn_add_n/sub_n 4 normal, 6 in-place 39 40 mpn_mul_1 4 normal, 6 in-place 41 mpn_addmul_1 6 42 mpn_submul_1 7 43 44 mpn_mul_basecase 6 cycles/crossproduct (approx) 45 46 mpn_sqr_basecase 3.5 cycles/crossproduct (approx) 47 or 7.0 cycles/triangleproduct (approx) 48 49 mpn_l/rshift 1.75 50 51 52 53The shifts ought to be able to go at 1.5 c/l, but not much effort has been 54applied to them yet. 55 56In-place operations, and all addmul, submul, mul_basecase and sqr_basecase 57calls, suffer from pipeline anomalies associated with write combining and 58movd reads and writes to the same or nearby locations. The movq 59instructions do not trigger the same hardware problems. Unfortunately, 60using movq and splitting/combining seems to require too many extra 61instructions to help. Perhaps future chip steppings will be better. 62 63 64 65NOTES 66 67The Pentium-4 pipeline "Netburst", provides for quite a number of surprises. 68Many traditional x86 instructions run very slowly, requiring use of 69alterative instructions for acceptable performance. 70 71adcl and sbbl are quite slow at 8 cycles for reg->reg. paddq of 32-bits 72within a 64-bit mmx register seems better, though the combination 73paddq/psrlq when propagating a carry is still a 4 cycle latency. 74 75incl and decl should be avoided, instead use add $1 and sub $1. Apparently 76the carry flag is not separately renamed, so incl and decl depend on all 77previous flags-setting instructions. 78 79shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest 80integer instructions (addl, subl, orl, andl, and some more). shldl and 81shrdl seem to have 13 and 15 cycles latency, respectively. Bizarre. 82 83movq mmx -> mmx does have 6 cycle latency, as noted in the documentation. 84pxor/por or similar combination at 2 cycles latency can be used instead. 85The movq however executes in the float unit, thereby saving MMX execution 86resources. With the right juggling, data moves shouldn't be on a dependent 87chain. 88 89L1 is write-through, but the write-combining sounds like it does enough to 90not require explicit destination prefetching. 91 92xmm registers so far haven't found a use, but not much effort has been 93expended. A configure test for whether the operating system knows 94fxsave/fxrestor will be needed if they're used. 95 96 97 98REFERENCES 99 100Intel Pentium-4 processor manuals, 101 102 http://developer.intel.com/design/pentium4/manuals 103 104"Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001, 105order number 248966. Available on-line: 106 107 http://developer.intel.com/design/pentium4/manuals/248966.htm 108 109 110 111---------------- 112Local variables: 113mode: text 114fill-column: 76 115End: 116