lib/tommath/tommath.tex

129 great deal of work in which multiple precision mathematics was needed. Understanding the possibilities for speeding up
130 multiple precision calculations is often very important since we deal with outdated machine architecture where modular
170 raise or lower the precision of the numbers we are dealing with.  For example, in decimal we almost immediate can
171 reason that $7$ times $6$ is $42$.  However, $42$ has two digits of precision as opposed to one digit we started with.
172 Further multiplications of say $3$ result in a larger precision result $126$.  In these few examples we have multiple
173 precisions for the numbers we are working with.  Despite the various levels of precision a single subset\footnote{With the occasional optimization.}
176 By way of comparison a fixed or single precision operation would lose precision on various operations.  For example, in
177 the decimal system with fixed precision $6 \cdot 7 = 2$.
179 Essentially at the heart of computer based multiple precision arithmetic are the same long-hand algorithms taught in
183 The most prevalent need for multiple precision arithmetic, often referred to as ``bignum'' math, is within the implementation
187 Java \cite{JAVA} only provide instrinsic support for integers which are relatively small and single precision.
205 language\footnote{As per the ISO C standard.  However, each compiler vendor is allowed to augment the precision as they
209 rendering any protocol based on the algorithm insecure.  Multiple precision algorithms solve this very problem by
210 extending the range of representable integers while using single precision data types.
212 Most advancements in fast multiple precision arithmetic stem from the need for faster and more efficient cryptographic
218 However, cryptography is not the only field of study that can benefit from fast multiple precision integer routines.
219 Another auxiliary use of multiple precision integers is high precision floating point data types.
222 floating point is meant to be implemented in hardware the precision of the mantissa is often fairly small
223 (\textit{23, 48 and 64 bits}).  The mantissa is merely an integer and a multiple precision integer could be used to create
224 a mantissa of much larger precision than hardware alone can efficiently support.  This approach could be useful where
231 \index{precision}
232 The benefit of multiple precision representations over single or fixed precision representations is that
233 no precision is lost while representing the result of an operation which requires excess precision.  For example,
234 the product of two $n$-bit integers requires at least $2n$ bits of precision to be represented faithfully.  A multiple
235 precision algorithm would augment the precision of the destination to accomodate the result while a single precision system
236 would truncate excess bits to maintain a fixed level of precision.
238 It is possible to implement algorithms which require large integers with fixed precision algorithms.  For example, elliptic
239 curve cryptography (\textit{ECC}) is often implemented on smartcards by fixing the precision of the integers to the maximum
245 Multiple precision algorithms have the most overhead of any style of arithmetic.  For the the most part the
247 platforms.  However, multiple precision algorithms do offer the most flexibility in terms of the magnitude of the
248 inputs.  That is, the same algorithms based on multiple precision integers can accomodate any reasonable size input
253 The purpose of this text is to instruct the reader regarding how to implement efficient multiple precision algorithms.
261 algorithm for performing multiple precision integer addition.  However, the description lacks any discussion concerning
270 To solve this problem the focus of this text is on the practical aspects of implementing a multiple precision integer
280 This text shall also serve as a walkthrough of the creation of multiple precision algorithms from scratch.  Showing
285 A multiple precision integer of $n$-digits shall be denoted as $x = (x_{n-1}, \ldots, x_1, x_0)_{ \beta }$ and represent
293 \ref{sec:MPINT}.  For the purposes of this text a ``multiple precision integer'' and an ``mp\_int'' are assumed to be
302 a plain integer with no additional multiple-precision members.  That is, algorithms that use integers as opposed to
305 precision algorithm to solve the same problem.
308 The variable $\beta$ represents the radix of a single digit of a multiple precision integer and
309 must be of the form $q^p$ for $q, p \in \Z^+$.  A single precision variable must be able to represent integers in
310 the range $0 \le x < q \beta$ while a double precision variable must be able to represent integers in the range
316 a single precision integer type, while, the data type \textbf{mp\_word} will represent a double precision integer type.  In
317 several algorithms (notably the Comba routines) temporary results will be stored in arrays of double precision mp\_words.
318 For the purposes of this text $x_j$ will refer to the $j$'th digit of a single precision array and $\hat x_j$ will refer to
319 the $j$'th digit of a double precision array.  Whenever an expression is to be assigned to a double precision
320 variable it is assumed that all single precision variables are promoted to double precision during the evaluation.
321 Expressions that are assigned to a single precision variable are truncated to fit within the precision of a single
322 precision data type.
324 For example, if $\beta = 10^2$ a single precision data type may represent a value in the
325 range $0 \le x < 10^3$, while a double precision data type may represent a value in the range $0 \le x < 10^5$.  Let
326 $a = 23$ and $b = 49$ represent two single precision variables.  The single precision product shall be written
327 as $c \leftarrow a \cdot b$ while the double precision product shall be written as $\hat c \leftarrow a \cdot b$.
329 in a single precision data type and as a result $c \ne \hat c$.
332 Within the algorithm descriptions all variables are assumed to be scalars of either single or double precision
344 The norm of a multiple precision integer, for example $\vert \vert x \vert \vert$, will be used to represent the number of digits in the representation
350 single precision operations are considered to have the same cost\footnote{Except where explicitly noted.}.
351 That is a single precision addition, multiplication and division are assumed to take the same time to
430 LibTomMath is a free and open source multiple precision integer library written entirely in portable ISO C.  By portable it
460 library exists which can be used to teach computer science students how to perform fast and reliable multiple precision
466 \cite{OPENSSL} have multiple precision integer arithmetic routines but would not be ideal for this text for
527 inability to accomodate multiple precision integers is the problem.  Futhermore, the solution must be written
562 Recall that most programming languages, in particular ISO C \cite{ISOC}, only have fixed precision data types that on their own cannot
563 be used to represent values larger than their precision will allow. The purpose of multiple precision algorithms is
564 to use fixed precision data types to create and manipulate multiple precision integers which may represent values
571 multiple precision arithmetic is essentially the same concept.  Larger integers are represented by adjoining fixed
572 precision computer words with the exception that a different radix is used.
574 What most people probably do not think about explicitly are the various other attributes that describe a multiple precision
583 will not exceed the allowed boundaries.  These three properties make up what is known as a multiple precision
588 The mp\_int structure is the ISO C based manifestation of what represents a multiple precision integer.  The ISO C standard does not provide for
621 array to accommodate the precision of the result.
624 precision integer.  It is padded with $(\textbf{alloc} - \textbf{used})$ zero digits.  The array is maintained in a least
654 In LibTomMath the multiple precision integer functions accept parameters from left to right as pointers to mp\_int
720 The logical starting point when actually writing multiple precision integer functions is the initialization and
768 used to dictate the minimum precision of newly initialized mp\_int integers.  Ideally, it is at least equal to the smallest
769 precision number you'll be working with.
875 able to augment the precision of an mp\_int and
883 result of an operation without loss of precision.  Quite often the size of the array given by the \textbf{alloc} member
1058 will not contribute to the precision of the result.  In fact, through subsequent operations more leading zero digits would
1059 accumulate to the point the size of the integer would be prohibitive.  As a result even though the precision is very
1166 If $b$ does not have enough room for the digits of $a$ it must first have its precision augmented via the mp\_grow
1190 $a.used$ the algorithm mp\_grow is used to augment the precision of $b$ (lines 30 to 33).  In order to
1499 Comparing a multiple precision integer is performed with the exact same algorithm used to compare two decimal numbers.  For example,
1637 In common twos complement fixed precision arithmetic negative numbers are easily represented by subtraction from the modulus.  For example, with 32-bit integers
1641 However, in multiple precision arithmetic negative numbers are not represented in the same way.  Instead a sign flag is used to keep track of the
1649 An unsigned addition of multiple precision integers is performed with the same long-hand algorithm used to add decimal numbers.  That is to add the
1807 The subtraction loop that begins on step seven is essentially the same as the addition loop of algorithm s\_mp\_add except single precision
1809 loops.  Under the assumption that two's complement single precision arithmetic is used this will successfully extract the desired carry.
2011 It is quite common to think of a multiple precision integer as a polynomial in $x$, that is $y = f(\beta)$ where $f(x) = \sum_{i=0}^{n-1} a_i x^i$.
2021 operation to perform.  A single precision logical shift left is sufficient to multiply a single digit by two.
2063 are the same there is no need to perform two reads from the digits of $a$.  Step 6.1 performs a single precision shift on the current digit $a_n$ to
2064 obtain what will be the carry for the next iteration.  Step 6.2 calculates the $n$'th digit of the result as single precision shift of $a_n$ plus
2079 is the use of the logical shift operator on line 52 to perform a single precision doubling.
2244 it does not require single precision division.  This algorithm does not actually return an error code as it cannot fail.
2377 mp\_mul\_2d by first using whole digit shifts then single precision shifts.  This algorithm will also produce the remainder of the division
2481 algorithms of any multiple precision integer package.  The set of multiplier algorithms include integer multiplication, squaring and modular reduction
2482 where in each of the algorithms single precision multiplication is the dominant operation performed.  This chapter will discuss integer multiplication
2488 35\% of the time performing squaring and 25\% of the time performing multiplications.} of the processor time is spent performing single precision
2494 1962.  This algorithm can multiply two numbers with considerably fewer single precision multiplications when compared to the long-hand approach.
2502 algorithm that school children are taught.  The algorithm is considered an $O(n^2)$ algorithm since for two $n$-digit inputs $n^2$ single precision
2503 multiplications are required.  More specifically for a $m$ and $n$ digit input $m \cdot n$ single precision multiplications are required.  To
2556 This algorithm computes the unsigned product of two inputs $a$ and $b$, limited to an output precision of $digs$ digits.  While it may seem
2559 Algorithm s\_mp\_mul\_digs differs from these cited references since it can produce a variable output precision regardless of the precision of the
2593 Step 5.4.1 introduces the hat symbol (\textit{e.g. $\hat r$}) which represents a double precision variable.  The multiplication on that step
2594 is assumed to be a double wide output single precision multiplication.  That is, two single precision variables are multiplied to produce a
2595 double precision result.  The step is somewhat optimized from a long-hand multiplication algorithm because the carry from the addition in step
2596 5.4.1 is propagated through the nested loop.  If the carry was not propagated immediately it would overflow the single precision digit
2601 exceed the precision requested.
2702 At the nested $O(n^2)$ level the Comba method adds the product of two single precision variables to each column of the output
2703 independently.  A serious obstacle is if the carry is lost, due to lack of precision before the algorithm has a chance to fix
2705 three single precision multiplications.  If the precision of the accumulator for the output digits is less then $3 \cdot (\beta - 1)^2$ then
2710 from earlier that a double precision type has $\alpha$ bits of resolution and a single precision digit has $lg(\beta)$ bits of precision.  Given these
2723 Let $\rho = lg(\beta)$ represent the number of bits in a single precision digit.  By further re-arrangement of the equation the final solution is
2742 Place an array of \textbf{MP\_WARRAY} single precision digits named $W$ on the stack. \\
2776 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision.
2891 The polynomial basis multiplication algorithms all require fewer single precision multiplications than a straight Comba approach.  However,
2902 \item  The ratio of clock cycles for single precision multiplication versus other simpler operations such as addition, shifting, etc.  For example
3151 large numbers.  For example, a 10,000 digit multiplication takes approximaly 99,282,205 fewer single precision multiplications with
3219 available but in fact there is.  Consider the multiplication of $576$ against $241$.  In total there will be nine single precision multiplications
3225 For any $n$-digit input, there are ${{\left (n^2 + n \right)}\over 2}$ possible unique single precision multiplications required compared to the $n^2$
3327 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop.  Squaring has an additional
3483 Now if $5n$ single precision additions and a squaring of $n$-digits is faster than multiplying two $n$-digit numbers and doubling then
3486 Let $p$ represent the cost of a single precision addition and $q$ the cost of a single precision multiplication both in terms of time\footnote{Or
3628 DSP intuition on its own will not work as these numbers are considerably larger than the precision of common DSP floating point data types.
3632 The trick used to optimize the above equation is based on a technique of emulating floating point data types with fixed precision integers.  Fixed
3652 The precision of the division is proportional to the value of $q$.  If the divisor $b$ is used frequently as is the case with
3675 precision.
3677 Let $n$ represent the number of digits in $b$.  This algorithm requires approximately $2n^2$ single precision multiplications to produce the quotient and
3678 another $n^2$ single precision multiplications to find the residue.  In total $3n^2$ single precision multiplications are required to
3686 Using the fixed point representation a modular reduction can be performed with $3n^2$ single precision multiplications.  If that were the best
3687 that could be achieved a full division\footnote{A division requires approximately $O(2cn^2)$ single precision multiplications for a small value of $c$.
3688 See~\ref{sec:division} for further details.} might as well be used in its place.  The key to optimizing the reduction is to reduce the precision of
3716 The quotient is now found using $(m + 1)(m) = m^2 + m$ single precision multiplications and the residue with an additional $m^2$ single
3717 precision multiplications, ignoring the subtractions required.  In total $2m^2 + m$ single precision multiplications are required to find the residue.
3727 So far the reduction algorithm has been optimized from $3m^2$ single precision multiplications down to $2m^2 + m$ single precision multiplications.  As
3732 half of the product.  It would be nice to be able to remove those digits from the product to effectively cut down the number of single precision
3736 The value of $\mu$ is a $m$-digit number and $q_0$ is a $m + 1$ digit number.  Using a full multiplier $(m + 1)(m) = m^2 + m$ single precision
3738 of single precision multiplications to ${m^2 + m} \over 2$ single precision multiplications.
3749 only the lower $m+1$ digits requires ${m^2 + 3m - 2} \over 2$ single precision multiplications.
3751 With both optimizations in place the algorithm is the algorithm Barrett proposed.  It requires $m^2 + 2m - 1$ single precision multiplications which
3801 for the quotient to have enough precision.  If $a$ is the product of two numbers that were already reduced modulo $b$, this will not be a problem.
3807 instead of stopping at a given level of precision it starts at a given level of precision.  This optimal algorithm can only be used if the number
3826 the number of single precision multiplications required.  However, the optimization is only safe if $\beta$ is much larger than the number of digits
3937 Let $k = \lfloor lg(n) \rfloor + 1$ represent the number of bits in $n$.  The current algorithm requires $2k^2$ single precision shifts
3938 and $k^2$ single precision additions.  At this rate the algorithm is most certainly slower than Barrett reduction and not terribly useful.
3961 precision shifts has now been reduced from $2k^2$ to $k^2 + k$ which is only a small improvement.
4111 Using a quick inspection this algorithm requires $n$ single precision multiplications for the outer loop and $n^2$ single precision multiplications
4112 in the inner loop.  In total $n^2 + n$ single precision multiplications which compares favourably to Barrett at $n^2 + 2n - 1$ single precision
4130 The Montgomery reduction requires fewer single precision multiplications than a Barrett reduction, however it is much slower due to the serial
4195 4.1 can be single precision only since $ab \mbox{ (mod }\beta\mbox{)} \equiv (a \mbox{ mod }\beta)(b \mbox{ mod }\beta)$.  Some multipliers such
4197 a single precision multiplication instead half the amount of time is spent.
4218 forces the compiler to use a single precision multiplication and prevents any concerns about loss of precision.   Line 110 fixes the carry
4273 then a x86 multiplier could produce the 62-bit product and use  the ``shrd'' instruction to perform a double-precision right shift.  The proof
4681 calling the half precision multipliers, addition and division by $\beta$ algorithms.
5245 This chapter discusses the various higher level algorithms that are required to complete a well rounded multiple precision integer package.  These
5504 of precision which when reduced modulo $lg(\beta)$ produces the value of $k$.  In this case $k$ is the number of bits in the leading digit which is