1Copyright 2000, 2001 Free Software Foundation, Inc. 2 3This file is part of the GNU MP Library. 4 5The GNU MP Library is free software; you can redistribute it and/or modify 6it under the terms of the GNU Lesser General Public License as published by 7the Free Software Foundation; either version 3 of the License, or (at your 8option) any later version. 9 10The GNU MP Library is distributed in the hope that it will be useful, but 11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 12or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public 13License for more details. 14 15You should have received a copy of the GNU Lesser General Public License 16along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. 17 18 19 20 21 22 INTEL P6 MPN SUBROUTINES 23 24 25 26This directory contains code optimized for Intel P6 class CPUs, meaning 27PentiumPro, Pentium II and Pentium III. The mmx and p3mmx subdirectories 28have routines using MMX instructions. 29 30 31 32STATUS 33 34Times for the loops, with all code and data in L1 cache, are as follows. 35Some of these might be able to be improved. 36 37 cycles/limb 38 39 mpn_add_n/sub_n 3.7 40 41 mpn_copyi 0.75 42 mpn_copyd 1.75 (or 0.75 if no overlap) 43 44 mpn_divrem_1 39.0 45 mpn_mod_1 21.5 46 mpn_divexact_by3 8.5 47 48 mpn_mul_1 5.5 49 mpn_addmul/submul_1 6.35 50 51 mpn_l/rshift 2.5 52 53 mpn_mul_basecase 8.2 cycles/crossproduct (approx) 54 mpn_sqr_basecase 4.0 cycles/crossproduct (approx) 55 or 7.75 cycles/triangleproduct (approx) 56 57Pentium II and III have MMX and get the following improvements. 58 59 mpn_divrem_1 25.0 integer part, 17.5 fractional part 60 61 mpn_l/rshift 1.75 62 63 64 65 66NOTES 67 68Write-allocate L1 data cache means prefetching of destinations is unnecessary. 69 70Mispredicted branches have a penalty of between 9 and 15 cycles, and even up 71to 26 cycles depending how far speculative execution has gone. The 9 cycle 72minimum penalty comes from the issue pipeline being 9 stages. 73 74A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4, 755, 6 or 7 limb operations are all the same. The 0.75 cycles/limb would be 3 76cycles per 16 byte block. 77 78 79 80 81CODING 82 83Instructions in general code have been shown grouped if they can execute 84together, which means up to three instructions with no successive 85dependencies, and with only the first being a multiple micro-op. 86 87P6 has out-of-order execution, so the groupings are really only showing 88dependent paths where some shuffling might allow some latencies to be 89hidden. 90 91 92 93 94REFERENCES 95 96"Intel Architecture Optimization Reference Manual", 1999, revision 001 dated 9702/99, order number 245127 (order number 730795-001 is in the document too). 98Available on-line: 99 100 http://download.intel.com/design/PentiumII/manuals/245127.htm 101 102"Intel Architecture Optimization Manual", 1997, order number 242816. This 103is an older document mostly about P5 and not as good as the above. 104Available on-line: 105 106 http://download.intel.com/design/PentiumII/manuals/242816.htm 107 108 109 110---------------- 111Local variables: 112mode: text 113fill-column: 76 114End: 115