lib/tommath/tommath.tex

172 Further multiplications of say $3$ result in a larger precision result $126$.  In these few examples we have multiple
214 various cryptographic journals, can render algorithms such as RSA and Diffie-Hellman more efficient.  In fact, several
228 In fact the library discussed within this text has already been used to form a polynomial basis library\footnote{See \url{http://poly.libtomcrypt.org} for more details.}.
259 In most cases how an algorithm is explained and how it is actually implemented are two very different concepts.  For
272 to demonstrate algorithms with real implementations\footnote{In the ISO C programming language.} that have been field
316 a single precision integer type, while, the data type \textbf{mp\_word} will represent a double precision integer type.  In
328 In this particular case, $\hat c = 1127$ and $c = 127$.  The most significant digit of the product would not fit
349 To measure the efficiency of the specified algorithms, a modified big-Oh notation is used.  In this system all
356 baseline squaring (section \ref{sec:basesquare}) requires $O({{n^2 + n}\over 2})$ work.  In standard big-Oh notation these
454 be source compatible with another popular library which makes it more attractive for developers to use.  In this case the
494 are signaled to the host application.  This happens to be the fastest approach but definitely not the most versatile.  In
506 exponentiation.  In the grand scheme of ``bignum'' libraries LibTomMath is faster than the average library and usually
526 a problem along with allowable solution parameters should be identified and analyzed.  In this particular case the
567 As a well known analogy, school children are taught how to form numbers larger than nine by prepending more radix ten digits.  In the decimal system
626 first starting at the location indexed by zero\footnote{In C all arrays begin at zero.} in the array.  For example,
654 In LibTomMath the multiple precision integer functions accept parameters from left to right as pointers to mp\_int
668 of assignment expressions.  That is, the destination (output) is on the left and arguments (inputs) are on the right.  In
669 truth, it is entirely a matter of preference.  In the case of LibTomMath the convention from the MPI library has been
684 In the case of LibTomMath the only errors that are checked for are related to inappropriate inputs (division by zero for
782 a subsequent expression (or body of expressions) are to be evaluated upto $c - b$ times so long as $b \le c$.  In each
804 In order to assure the mp\_int is in a known state the digits must be set to zero.  On most platforms this could have been
866 a standard C library function.  In this case the free() function.  Since free() only deallocates the memory the pointer
1058 will not contribute to the precision of the result.  In fact, through subsequent operations more leading zero digits would
1096 Note on line 28 how to test for the \textbf{used} count is made on the left of the \&\& operator.  In the C programming
1125 In the previous chapter a series of low level algorithms were established that dealt with initializing and maintaining
1190 $a.used$ the algorithm mp\_grow is used to augment the precision of $b$ (lines 30 to 33).  In order to
1220 In this case an alias is used to access the
1247 finally the carries are propagated.  In this example the middle column production phase will typically be nested as it
1477 mp\_int.  Step 2.1 will multiply the current result by sixteen making room for four more bits in the less significant positions.  In step 2.2 the
1590 comparison code.  When the signs are equal the digits of the inputs must be compared to determine the correct result.  In step
1637 In common twos complement fixed precision arithmetic negative numbers are easily represented by subtraction from the modulus.  For example, with 32-bit integers
1758 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$.  In ISO C an ``unsigned long''
1813 third bit of $0101_2$ is subtracted from the result it will cause another carry.  In this case though the carry will be forced to propagate all the
1832 (lines 25 and 26).  In reality the $min$ and $max$ variables are only aliases and are only
2014 In order to facilitate operations on polynomials in $x$ as above a series of simple ``digit'' algorithms have to be established.  That is to shift
2020 In a binary system where the radix is a power of two multiplication by two not only arises often in other algorithms it is a fairly efficient
2143 degree.  In this case $f(x) \cdot x = a_n x^{n+1} + a_{n-1} x^n + ... + a_0 x$.  From a scalar basis point of view multiplying by $x$ is equivalent to
2184 The loop on step 7 copies the digit from the tail to the head.  In each iteration the window is moved down one digit.   The last loop on
2620 digits as output.  In each iteration of the outer loop the $pb$ variable is set (line 49) to the maximum
2626 is required for the product.  In x86 terms for example, this means using the MUL instruction.
2641 twist is placed on how the columns of the result are produced.  In the standard long-hand algorithm rows of products
2642 are produced then added together to form the final result.  In the baseline algorithm the columns are added together
2645 In the Comba algorithm the columns of the result are produced entirely independently of each other.  That is at
2697 With that algorithm and $k = 5$ and $\beta = 10$ the following vector is produced $\vec x= \left < 1, 3, 8, 8, 1, 6 \right >$.  In this case
2730 The defaults for LibTomMath are $\beta = 2^{28}$ and $\alpha = 2^{64}$ which means that $k$ is bounded by $k < 257$.  In this configuration
2787 means we have to scan one integer upwards and the other downwards.  $a$ starts at $tx$ and $b$ starts at $ty$.  In each
2824 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication.  In the following algorithms
2826 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required.  In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree.
2942 of this system of equations has made Karatsuba fairly popular.  In fact the cutoff point is often fairly low\footnote{With LibTomMath 0.18 it is 70 and 109 digits for the Intel P4 and AMD Athlon respectively.}
2993 In order to split the two inputs into their respective halves, a suitable \textit{radix point} must be chosen.  The radix point chosen must
3124 algorithm has a lower asymptotic running time of approximately $O(n^{1.464})$ but at an obvious cost in overhead.  In this
3219 available but in fact there is.  Consider the multiplication of $576$ against $241$.  In total there will be nine single precision multiplications
3245 appear twice hence the name ``double product''.  Every odd column is made up entirely of double products.  In fact every column is made up of double
3338 carry propagation can be moved to a $O(n)$ work level outside the $O(n^2)$ level.  In this case, we have an even simpler solution in mind.
3577 $\left [ 2 \right ] $ & In section 5.3 the fact that every column of a squaring is made up \\
3606 $r$ is said to be ``congruent to $a$ modulo $b$'' which is also written as $r \equiv a \mbox{ (mod }b\mbox{)}$.  In other vernacular $r$ is known as the
3637 In this system a $k$-bit integer $n$ would actually represent $n/2^q$.  For example, with $q = 4$ the integer $n = 37$ would actually represent the
3659 larger than the dividend.  In effect if $a$ is the dividend then $q$ should allow $0 \le \lfloor a/2^q \rfloor \le 1$ in order for this approach
3673 Provided that $2^q \ge a$ this algorithm will produce a quotient that is either exactly correct or off by a value of one.  In the context of Barrett
3678 another $n^2$ single precision multiplications to find the residue.  In total $3n^2$ single precision multiplications are required to
3717 precision multiplications, ignoring the subtractions required.  In total $2m^2 + m$ single precision multiplications are required to find the residue.
3721 represent the value of which the residue is desired.  In this case $q = 8$ since $10^7 < 9999^2$ meaning that $\mu = \lfloor \beta^{q}/b \rfloor = 10001$.
3734 In fact the lower $m - 2$ digits will not affect the upper half of the product at all and do not need to be computed.
3827 in the modulus.  In the source code this is evaluated on lines 36 to 44 where algorithm s\_mp\_mul\_high\_digs is used when it is
3831 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance.  Ideally this value should be computed once and stored for
3989 loop.  Note that for the iterations $t = 2, 5, 6$ and $8$ where the result $x$ is not changed.  In those iterations the $t$'th bit of $x$ is
4026 In each iteration of the loop on step 1 a new value of $\mu$ must be calculated.  The value of $-1/n_0 \mbox{ (mod }\beta\mbox{)}$ is used
4047 the algorithm.  To get the true residue the value must be multiplied by $\beta^k$.  In this case $\beta^k \equiv 15 \mbox{ (mod }n\mbox{)}$ and
4112 in the inner loop.  In total $n^2 + n$ single precision multiplications which compares favourably to Barrett at $n^2 + 2n - 1$ single precision
4200 4.3 will do.  In effect over the $n.used$ iterations of the outer loop the $n.used$'th lower columns all have the their carries propagated forwards.  Note
4334 sum in step 4 will exceed $n - k$.  In practice fewer than three passes of the algorithm are required to reduce virtually every input in the
4369 is considerably larger than $(n - k - 1)^2 = 63504$ the algorithm still converges on the modular residue exceedingly fast.  In this case only
4383 performed by moving whole digits to the right $p$ places.  In practice division by $\beta^p$ is much faster than division by $2^p$ for any $p$.
4454 the algorithm will resume if further reduction passes are required.  In theory it could be placed at the top of the function however, the size of
4535 In general the restricted Diminished Radix reduction algorithm is much faster since it has considerably lower overhead.  However, this new
4679 In theory Montgomery and Barrett reductions would require roughly the same amount of time to complete.  However, in practice since Montgomery
4684 shines are based on the discrete logarithm problem such as Diffie-Hellman \cite{DH} and ElGamal \cite{ELGAMAL}.  In these algorithms
4758 multiplied against the current product.  In each iteration the product is squared which doubles the exponent of the individual terms of the
4821 on step 3.1.  In the following step if the most significant bit of $b$ is one the copy of $a$ is multiplied against $c$.  The value
4822 of $b$ is shifted left one bit to make the next bit down from the most signficant bit the new most significant bit.  In effect each
4963 a single squaring took place instead of a squaring and multiplication.  In total the first method requires $10$ multiplications and $18$
4966 In general the sliding window method is never slower than the generic $k$-ary method and often it is slightly faster.
5019 In order to keep the algorithms in a known state the first step on line 29 is to reject any negative modulus as input.  If the exponent is
5153          $1$ then there would be $lg(\beta) - 1$ zero bits before the first non-zero bit.  In this case bits are ignored until a non-zero bit is found.
5154    \item When $mode = 1$ a non-zero bit has been seen before and a new $winsize$-bit window has not been formed yet.  In this mode leading $0$ bits
5173 upwards.  In effect the digit is read from most significant bit to least significant bit and since the digits are read from leading to
5291 digits a much simpler division may be used to form an educated guess at what the value must be.  In this case $k = \lfloor 54/23\rfloor = 2$ quickly
5461 $2\beta - {2 \over \beta}$ which means that a digit of the quotient must be first produced by another means.  In this case $y$ is shifted
5481 outside their respective boundaries.  For example, if $t = 0$ or $i \le 1$ then the digits would be undefined.  In those cases the digits should
5491 The implementation of this algorithm differs slightly from the pseudo code presented previously.  In this algorithm either of the quotient $c$ or
5504 of precision which when reduced modulo $lg(\beta)$ produces the value of $k$.  In this case $k$ is the number of bits in the leading digit which is
5606 In this implementation the destination $c$ may point to the same mp\_int as the source $a$ since the result is written after the digit is
5651 a multiplication by $\lfloor \beta / 3 \rfloor$ and the appropriate shift and residual fixup.  In essence it is much like the Barrett reduction
5678 In this case the $n$'th root is desired and $f(x) = x^n - a$ where $a$ is the integer of which the root is desired.  The derivative of $f(x)$ is
5738 factoring for example, can make use of random values as starting points to find factors of a composite integer.  In this case the algorithm presented
5959 In particular, we would like $a - b$ to decrease in magnitude which implies that $b \ge a$.
5982 The algorithm in figure~\ref{fig:gcd2} will eventually terminate since $b \ge a$ the subtraction in step 1.2 will be a value less than $b$.  In other
5989 not divide the greatest common divisor but will divide $b - a$.  In this case ${b - a} \over p$ is also an integer and still divisible by
6031 In particular the value of $p$ should be chosen such that the division on step 5.3.1 occur often.  It also helps that division by $p$ be easy
6080 Knuth \cite[pp. 338]{TAOCPV2} but has been modified to be simpler to explain.  In theory it achieves the same asymptotic working time as
6328 bit of $k$ is required, however, it makes the algorithm simpler to follow to perform an addition. In practice an exclusive-or and addition have the same
6426 The astute reader will observe that steps seven through nine are very similar to the binary greatest common divisor algorithm mp\_gcd.  In this case
6447 When the modulus $b$ is odd the variables $A$ and $C$ are fixed and are not required to compute the inverse.  In particular by attempting to solve
6481 be of any practical use.  In the case of LibTomMath the default limit $q = 256$ was chosen since it is not too high and will eliminate
6535 If $n$ is composite then any given base $a$ does not have to have a period which divides $n - 1$.  In which case