lib/tommath/tommath.tex

209 rendering any protocol based on the algorithm insecure.  Multiple precision algorithms solve this very problem by
213 primitives.  Faster modular reduction and exponentiation algorithms such as Barrett's algorithm, which have appeared in
235 precision algorithm would augment the precision of the destination to accomodate the result while a single precision system
259 In most cases how an algorithm is explained and how it is actually implemented are two very different concepts.  For
260 example, the Handbook of Applied Cryptography (\textit{HAC}), algorithm 14.7 on page 594, gives a relatively simple
261 algorithm for performing multiple precision integer addition.  However, the description lacks any discussion concerning
263 as the text would lead people to believe.  Similarly the division routine (\textit{algorithm 14.20, pp. 598}) does not
277 by the actual C source code that implements the algorithm.  The pseudo-code can be used to implement the same
278 algorithm in other programming languages as the reader sees fit.
294 synonymous.  When an algorithm is specified to accept an mp\_int variable it is assumed the various auxliary data members
300 For certain discussions more generic algorithms are presented to help the reader understand the final algorithm used
301 to solve a given problem.  When an algorithm is described as accepting an integer input it is assumed the input is
305 precision algorithm to solve the same problem.
315 Within the source code that will be presented for each algorithm, the data type \textbf{mp\_digit} will represent
332 Within the algorithm descriptions all variables are assumed to be scalars of either single or double precision
359 result small constant factors in the work effort will make an observable difference in algorithm efficiency.
408 are also designed to be easy but will require a program or algorithm to be implemented to arrive at the answer.  These
413 involve devising a new algorithm or implementing a variation of another algorithm previously presented.  Readers who can
442 even though this library is written entirely in ISO C, considerable care has been taken to optimize the algorithm implementations within the
450 algorithm \textbf{mp\_mul()} which will automatically use Toom--Cook, Karatsuba, Comba or baseline multiplication
461 integer arithmetic.  To this end the source code has been given quite a few comments and algorithm discussion points.
532 before implementing a modular exponentiation algorithm one would implement a modular reduction algorithm.
559 This chapter discusses the core algorithms of the library which are the dependents for every other algorithm.
736 structure are set to valid values.  The mp\_init algorithm will perform such an action.
810 (lines 34 through 35) to their respective default states.  At this point the algorithm has succeeded and
816 returned to the application's memory pool with the mp\_clear algorithm.
840 This algorithm accomplishes two goals.  First, it clears the digits and the other mp\_int members.  This ensures that
844 The logic behind the algorithm is extended by marking cleared mp\_int structures so that subsequent calls to this
845 algorithm will not try to free the memory multiple times.  Cleared mp\_ints are detectable by having a pre-defined invalid
848 Once an mp\_int has been cleared the mp\_int structure is no longer in a valid state for any other algorithm
858 The algorithm only operates on the mp\_int if it hasn't been previously cleared.  The if statement (line 25)
885 must be re-sized appropriately to accomodate the result.  The mp\_grow algorithm will provide this functionality.
943 of input mp\_ints to a given algorithm.  The purpose of algorithm mp\_init\_size is similar to mp\_init except that it
971 This algorithm will initialize an mp\_int structure $a$ like algorithm mp\_init with the exception that the number of
976 Like algorithm mp\_init, the mp\_int structure is initialized to a default state representing the integer zero.  This
977 particular algorithm is useful if it is known ahead of time the approximate size of the input.  If the approximation is
1000 The purpose of algorithm mp\_init\_multi is to initialize a variable length array of mp\_int structures in a single
1024 The algorithm will initialize the array of mp\_int variables one at a time.  If a runtime error has been detected
1042 the algorithm can backtrack and free the previously initialized structures (lines 28 to 47).
1062 The mp\_clamp algorithm is designed to solve this very problem.  It will trim high-order zeros by decrementing the
1085 As can be expected this algorithm is very simple.  The loop on step one is expected to iterate only once or twice at
1111 $\left [ 1 \right ]$ & Discuss the relevance of the algorithm mp\_clamp.  What does it prevent? \\
1113 $\left [ 1 \right ]$ & Give an example of when the algorithm  mp\_init\_copy might be useful. \\
1127 level basis of the entire library.  While these algorithm are relatively trivial it is important to understand how they
1138 value as the mp\_int it was copied from.  The mp\_copy algorithm provides this functionality.
1162 This algorithm copies the mp\_int $a$ such that upon succesful termination of the algorithm the mp\_int $b$ will
1167 algorithm.  The digits of $a$ are copied over the digits of $b$ and any excess digits of $b$ are set to zero (step two
1171 \textbf{Remark.}  This algorithm also introduces a new idiosyncrasy that will be used throughout the rest of the
1173 step one of the mp\_copy algorithm the return of mp\_grow is not explicitly checked to ensure it succeeded.  Text space is
1174 limited so it is assumed that if a algorithm fails it will clear all temporarily allocated mp\_ints and return
1185 Occasionally a dependent algorithm may copy an mp\_int effectively into itself such as when the input and output
1190 $a.used$ the algorithm mp\_grow is used to augment the precision of $b$ (lines 30 to 33).  In order to
1230 The second reason is that pointer aliases often can make an algorithm simpler to read.  Consider the first ``for''
1258 mp\_init\_copy algorithm has been designed to help perform this task.
1277 This algorithm will initialize an mp\_int variable and copy another previously initialized mp\_int variable into it.  As
1278 such this algorithm will perform two operations in one step.
1292 Reseting an mp\_int to the default state is a common step in many algorithms.  The mp\_zero algorithm will be the algorithm used to
1313 This algorithm simply resets a mp\_int to the default state.
1327 With the mp\_int representation of an integer, calculating the absolute value is trivial.  The mp\_abs algorithm will compute
1348 This algorithm computes the absolute of an mp\_int input.  First it copies $a$ over $b$.  This is an example of an
1349 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful.  This allows,
1360 This fairly trivial algorithm first eliminates non--required duplications (line 28) and then sets the
1364 With the mp\_int representation of an integer, calculating the negation is also trivial.  The mp\_neg algorithm will compute
1389 This algorithm computes the negation of an input.  First it copies $a$ over $b$.  If $a$ has no used digits then
1390 the algorithm returns immediately.  Otherwise it flips the sign flag and stores the result in $b$.  Note that if
1391 $a$ had no digits then it must be positive by definition.  Had step three been omitted then the algorithm would return
1407 Often a mp\_int must be set to a relatively small value such as $1$ or $2$.  For these cases the mp\_set algorithm is useful.
1429 This algorithm sets a mp\_int to a small single digit value.  Step number 1 ensures that the integer is reset to the default state.  The
1452 To overcome the limitations of the mp\_set algorithm the mp\_set\_int algorithm is ideal.  It accepts a ``long''
1476 The algorithm performs eight iterations of a simple loop where in each iteration four bits from the source are added to the
1499 Comparing a multiple precision integer is performed with the exact same algorithm used to compare two decimal numbers.  For example,
1549 the zero'th digit.  If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}.
1567 comparison a trivial signed comparison algorithm can be written.
1608 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\
1610 $\left [ 3 \right ]$ & Give the probability that algorithm mp\_cmp\_mag will have to compare $k$ digits  \\
1649 An unsigned addition of multiple precision integers is performed with the same long-hand algorithm used to add decimal numbers.  That is to add the
1650 trailing digits first and propagate the resulting carry upwards.  Since this is a lower level algorithm the name will have a ``s\_'' prefix.
1697 This algorithm is loosely based on algorithm 14.7 of HAC \cite[pp. 594]{HAC} but has been extended to allow the inputs to have different magnitudes.
1698 Coincidentally the description of algorithm A in Knuth \cite[pp. 266]{TAOCPV2} shares the same deficiency as the algorithm from \cite{HAC}.  Even the
1747 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm.  The principle difference is that the
1748 unsigned subtraction algorithm requires the result to be positive.  That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must
1749 be met for this algorithm to function properly.  Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly.
1750 This algorithm as will be shown can be used to create functional signed addition and subtraction algorithms.
1753 For this algorithm a new variable is required to make the description simpler.  Recall from section 1.3.1 that a mp\_digit must be able to represent
1755 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a
1797 This algorithm performs the unsigned subtraction of two mp\_int variables under the restriction that the result must be positive.  That is when
1798 passing variables $a$ and $b$ the condition that $\vert a \vert \ge \vert b \vert$ must be met for the algorithm to function correctly.  This
1799 algorithm is loosely based on algorithm 14.9 \cite[pp. 595]{HAC} and is similar to algorithm S in \cite[pp. 267]{TAOCPV2} as well.  As was the case
1800 of the algorithm s\_mp\_add both other references lack discussion concerning various practical details such as when the inputs differ in magnitude.
1802 The initial sorting of the inputs is trivial in this algorithm since $a$ is guaranteed to have at least the same magnitude of $b$.  Steps 1 and 2
1804 most $max$ digits in length as opposed to $max + 1$.  Similar to the addition algorithm the \textbf{used} count of $c$ is copied locally and
1807 The subtraction loop that begins on step seven is essentially the same as the addition loop of algorithm s\_mp\_add except single precision
1834 within this algorithm.  The aliases $tmpa$, $tmpb$ and $tmpc$ are initialized
1850 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be
1851 established.  This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data
1882 This algorithm performs the signed addition of two mp\_int variables.  There is no reference algorithm to draw upon from
1883 either \cite{TAOCPV2} or \cite{HAC} since they both only provide unsigned operations.  The algorithm is fairly
1915 forwarded to step three to check for errors.  This simplifies the description of the algorithm considerably and best
1919 s\_mp\_add and s\_mp\_sub that the mp\_clamp function is used at the end to trim excess digits.  The mp\_clamp algorithm will set the \textbf{sign}
1922 For example, consider performing $-a + a$ with algorithm mp\_add.  By the description of the algorithm the sign is set to \textbf{MP\_NEG} which would
1923 produce a result of $-0$.  However, since the sign is set first then the unsigned addition is performed the subsequent usage of algorithm mp\_clamp
1924 within algorithm s\_mp\_add will force $-0$ to become $0$.
1933 The source code follows the algorithm fairly closely.  The most notable new source code addition is the usage of the $res$ integer variable which
1934 is used to pass result of the unsigned operations forward.  Unlike in the algorithm, the variable $res$ is merely returned as is without
1935 explicitly checking it and returning the constant \textbf{MP\_OKAY}.  The observation is this algorithm will succeed or fail only if the lower
1939 The high level signed subtraction algorithm is essentially the same as the high level signed addition algorithm.
1969 This algorithm performs the signed subtraction of two inputs.  Similar to algorithm mp\_add there is no reference in either \cite{TAOCPV2} or
1970 \cite{HAC}.  Also this algorithm is restricted by algorithm s\_mp\_sub.  Chart \ref{fig:SubChart} lists the eight possible inputs and
1996 Similar to the case of algorithm mp\_add the \textbf{sign} is set first before the unsigned addition or subtraction.  That is to prevent the
1997 algorithm from producing $-a - -a = -0$ as a result.
2006 Much like the implementation of algorithm mp\_add the variable $res$ is used to catch the return code of the unsigned addition or subtraction operations
2055 This algorithm will quickly multiply a mp\_int by two provided $\beta$ is a power of two.  Neither \cite{TAOCPV2} nor \cite{HAC} describe such
2056 an algorithm despite the fact it arises often in other algorithms.  The algorithm is setup much like the lower level algorithm s\_mp\_add since
2115 This algorithm will divide an mp\_int by two using logical shifts to the right.  Like mp\_mul\_2 it uses a modified low level addition
2116 core as the basis of the algorithm.  Unlike mp\_mul\_2 the shift operations work from the leading digit to the trailing digit.  The algorithm
2175 This algorithm multiplies an mp\_int by the $b$'th power of $x$.  This is equivalent to multiplying by $\beta^b$.  The algorithm differs
2178 different third mp\_int because the original inputs are often still required.  Algorithm mp\_lshd (\textit{and similarly algorithm mp\_rshd}) is
2179 typically used on values where the original value is no longer required.  The algorithm will return success immediately if
2180 $b \le 0$ since the rest of algorithm is only valid when $b > 0$.
2243 This algorithm divides the input in place by the $b$'th power of $x$.  It is analogous to dividing by a $\beta^b$ but much quicker since
2244 it does not require single precision division.  This algorithm does not actually return an error code as it cannot fail.
2246 If the input $b$ is less than one the algorithm quickly returns without performing any work.  If the \textbf{used} count is less than or equal
2249 After the trivial cases of inputs have been handled the sliding window is setup.  Much like the case of algorithm mp\_lshd a sliding window that
2269 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful.  Instead of performing single
2308 This algorithm multiplies $a$ by $2^b$ and stores the result in $c$.  The algorithm uses algorithm mp\_lshd and a derivative of algorithm mp\_mul\_2 to
2311 First the algorithm will multiply $a$ by $x^{\lfloor b / lg(\beta) \rfloor}$ which will ensure that the remainder multiplicand is less than
2317 Essentially the loop is a generic version of algorithm mp\_mul\_2 designed to handle any shift count in the range $1 \le x < lg(\beta)$.  The $mask$
2320 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to
2321 complete.  It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow.
2376 This algorithm will divide an input $a$ by $2^b$ and produce the quotient and remainder.  The algorithm is designed much like algorithm
2377 mp\_mul\_2d by first using whole digit shifts then single precision shifts.  This algorithm will also produce the remainder of the division
2378 by using algorithm mp\_mod\_2d.
2387 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies.  The remainder $d$ may be optionally
2397 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$.  This
2398 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$.
2430 This algorithm will quickly calculate the value of $a \mbox{ (mod }2^b\mbox{)}$.  First if $b$ is less than or equal to zero the
2451 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\
2454 $\left [ 3 \right ] $ & Devise an efficient algorithm to multiply by small low hamming  \\
2458 $\left [ 2 \right ] $ & Modify the preceding algorithm to handle values of the form \\
2462                       & algorithm to multiply two integers in roughly $O(2n^2)$ time for \\
2466 $\left [ 5 \right ] $ & Improve the previous algorithm to have a working time of at most \\
2493 overall algorithm used is essentially the same.  Only ``recently'' have faster algorithms been studied.  First Karatsuba multiplication was discovered in
2494 1962.  This algorithm can multiply two numbers with considerably fewer single precision multiplications when compared to the long-hand approach.
2502 algorithm that school children are taught.  The algorithm is considered an $O(n^2)$ algorithm since for two $n$-digit inputs $n^2$ single precision
2506 The ``baseline multiplication'' algorithm is designed to act as the ``catch-all'' algorithm, only to be used when the faster algorithms cannot be
2507 used.  This algorithm does not use any particularly interesting optimizations and should ideally be avoided if possible.    One important
2508 facet of this algorithm, is that it has been modified to only produce a certain amount of output digits as resolution.  The importance of this
2510 will be at most $n + m$ digits.  Therefore, this algorithm can be reduced to a full multiplier by having it produce $n + m$ digits of the product.
2525 \hspace{3mm}1.1  Calculate $c = \vert a \vert \cdot \vert b \vert$ by the Comba method (\textit{see algorithm~\ref{fig:COMBAMULT}}).  \\
2556 This algorithm computes the unsigned product of two inputs $a$ and $b$, limited to an output precision of $digs$ digits.  While it may seem
2558 algorithm.  The algorithm is loosely based on algorithm 14.12 from \cite[pp. 595]{HAC} and is similar to Algorithm M of Knuth \cite[pp. 268]{TAOCPV2}.
2562 The first thing this algorithm checks for is whether a Comba multiplier can be used instead.   If the minimum digit count of either
2563 input is less than $\delta$, then the Comba method may be used instead.    After the Comba method is ruled out, the baseline algorithm begins.  A
2564 temporary mp\_int variable $t$ is used to hold the intermediate result of the product.  This allows the algorithm to be used to
2568 is given the count of digits to read from $b$ inside the nested loop.  If $pb \le 1$ then no more output digits can be produced and the algorithm
2595 double precision result.  The step is somewhat optimized from a long-hand multiplication algorithm because the carry from the addition in step
2640 At the heart of the Comba technique is once again the long-hand algorithm.  Except in this case a slight
2641 twist is placed on how the columns of the result are produced.  In the standard long-hand algorithm rows of products
2642 are produced then added together to form the final result.  In the baseline algorithm the columns are added together
2645 In the Comba algorithm the columns of the result are produced entirely independently of each other.  That is at
2647 after the nested loop to reduce the amount of work requiored. Succintly the first step of the algorithm is to compute
2697 With that algorithm and $k = 5$ and $\beta = 10$ the following vector is produced $\vec x= \left < 1, 3, 8, 8, 1, 6 \right >$.  In this case
2698 $241 \cdot 576$ is in fact $138816$ and the procedure succeeded.  If the algorithm is correct and as will be demonstrated shortly more
2699 efficient than the baseline algorithm why not simply always use this algorithm?
2703 independently.  A serious obstacle is if the carry is lost, due to lack of precision before the algorithm has a chance to fix
2709 The maximum number of terms in any column of a product is known as the ``column weight'' and strictly governs when the algorithm can be used.  Recall
2776 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision.
2778 The outer loop of this algorithm is more complicated than that of the baseline multiplier.  This is because on the inside of the
2798 the speed increase is actually much more.  With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply
2817 slower and also often doesn't exist.  This new algorithm only performs two reads per iteration under the assumption that the
2862 As a general rule of the algorithm when the inputs are split into $n$ parts each there are $2n - 1$ multiplications.  Each multiplication is of
2863 multiplicands that have $n$ times fewer digits than the inputs.  The asymptotic running time of this algorithm is
2887 of solving for the 2001 terms of $W(x)$ will certainly consume any savings the algorithm could offer for all but exceedingly large
2929 this algorithm recursively, the work factor becomes $O(n^{lg(3)})$ which is substantially better than the work factor $O(n^2)$ of the Comba technique.  It turns
2943 making it an ideal algorithm to speed up certain public key cryptosystems such as RSA and Diffie-Hellman.
2989 This algorithm computes the unsigned product of two inputs using the Karatsuba multiplication algorithm.  It is loosely based on the description
3000 of an additional temporary variable, the algorithm can avoid an addition memory allocation operation.
3021 The first algebraic portion of the algorithm is to split the two inputs into their halves.  However, instead of using mp\_mod\_2d and mp\_rshd
3029 When line 151 is reached, the algorithm has completed succesfully.  The ``error status'' variable $err$ is set to \textbf{MP\_OKAY} so that
3033 Toom-Cook $3$-Way \cite{TOOM} multiplication is essentially the polynomial basis algorithm for $n = 2$ except that the points  are
3052 the algorithm can be faster than a baseline multiplication.  However, the greater complexity of this algorithm places the cutoff point
3123 This algorithm computes the product of two mp\_int variables $a$ and $b$ using the Toom-Cook approach.  Compared to the Karatsuba multiplication, this
3124 algorithm has a lower asymptotic running time of approximately $O(n^{1.464})$ but at an obvious cost in overhead.  In this
3133 $f(y)$ and $g(y)$ which significantly speeds up the algorithm.
3150 The first obvious thing to note is that this algorithm is complicated.  The complexity is worth it if you are multiplying very
3153 algorithm is not practical as Karatsuba has a much lower cutoff point.
3167 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established.
3182 \hspace{3mm}3.1  $c \leftarrow a \cdot b$ using algorithm mp\_toom\_mul \\
3184 \hspace{3mm}4.1  $c \leftarrow a \cdot b$ using algorithm mp\_karatsuba\_mul \\
3188 \hspace{6mm}5.2.1  $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm fast\_s\_mp\_mul\_digs.  \\
3190 \hspace{6mm}5.3.1  $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm s\_mp\_mul\_digs.  \\
3201 This algorithm performs the signed multiplication of two inputs.  It will make use of any of the three unsigned multiplication algorithms
3202 available when the input is of appropriate size.  The \textbf{sign} of the result is not set until the end of the algorithm since algorithm
3253 The baseline squaring algorithm is meant to be a catch-all squaring algorithm.  It will handle any of the input sizes that the faster routines
3295 This algorithm computes the square of an input using the three observations on squaring.  It is based fairly faithfully on  algorithm 14.16 of HAC
3296 \cite[pp.596-597]{HAC}.  Similar to algorithm s\_mp\_mul\_digs, a temporary mp\_int is allocated to hold the result of the squaring.  This allows the
3299 The outer loop of this algorithm begins on step 4. It is best to think of the outer loop as walking down the rows of the partial results, while
3304 very algorithm.  The product $a_{ix}a_{iy}$ will lie in the range $0 \le x \le \beta^2 - 2\beta + 1$ which is obviously less than $\beta^2$ meaning that
3307 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial
3308 results calculated so far.  This involves expensive carry propagation which will be eliminated in the next algorithm.
3323 get progressively shorter as the algorithm proceeds.  This is what leads to the savings compared to using a multiplication to
3384 This algorithm computes the square of an input using the Comba technique.  It is designed to be a replacement for algorithm
3386 This algorithm is very similar to the Comba multiplier except with a few key differences we shall make note of.
3409 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring.  The minor exception
3423 Karatsuba multiplication, this algorithm can be applied recursively on the input and will achieve an asymptotic running time of
3426 If the asymptotic times of Karatsuba squaring and multiplication are the same, why not simply use the multiplication algorithm
3432 The 100 digit halves will not be squared using Karatsuba, but instead using the faster Comba based squaring algorithm.  If Karatsuba multiplication
3475 This algorithm computes the square of an input $a$ using the Karatsuba technique.  This algorithm is very similar to the Karatsuba based
3476 multiplication algorithm with the exception that the three half-size multiplications have been replaced with three half-size squarings.
3514 This implementation is largely based on the implementation of algorithm mp\_karatsuba\_mul.  It uses the same inline style to copy and
3524 redirected to the error trap higher up.  If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and
3528 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used
3529 instead of multiplication to find the five relations.  The reader is encouraged to read the description of the latter algorithm and try to
3530 derive their own Toom-Cook squaring algorithm.
3542 \hspace{3mm}1.1  $b \leftarrow a^2$ using algorithm mp\_toom\_sqr \\
3544 \hspace{3mm}2.1  $b \leftarrow a^2$ using algorithm mp\_karatsuba\_sqr \\
3548 \hspace{6mm}3.2.1  $b \leftarrow a^2$ using algorithm fast\_s\_mp\_sqr.  \\
3550 \hspace{6mm}3.3.1  $b \leftarrow a^2$ using algorithm s\_mp\_sqr.  \\
3561 This algorithm computes the square of the input using one of four different algorithms.  If the input is very large and has at least
3562 \textbf{TOOM\_SQR\_CUTOFF} or \textbf{KARATSUBA\_SQR\_CUTOFF} digits then either the Toom-Cook or the Karatsuba Squaring algorithm is used.  If
3563 neither of the polynomial basis algorithms should be used then either the Comba or baseline algorithm is used.
3574 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\
3619 The Barrett reduction algorithm \cite{BARRETT} was inspired by fast division algorithms which multiply by the reciprocal to emulate
3629 It would take another common optimization to optimize the algorithm.
3673 Provided that $2^q \ge a$ this algorithm will produce a quotient that is either exactly correct or off by a value of one.  In the context of Barrett
3677 Let $n$ represent the number of digits in $b$.  This algorithm requires approximately $2n^2$ single precision multiplications to produce the quotient and
3727 So far the reduction algorithm has been optimized from $3m^2$ single precision multiplications down to $2m^2 + m$ single precision multiplications.  As
3728 it stands now the algorithm is already fairly fast compared to a full integer division algorithm.  However, there is still room for
3733 multiplications.  If the number of digits in the modulus $m$ is far less than $\beta$ a full product is not required for the algorithm to work properly.
3741 After the quotient has been calculated it is used to reduce the input.  As previously noted the algorithm is not exact and it can be off by a small
3747 $O(m^2)$ multiplication algorithm only the lower $m+1$ digits of the product have to be computed.  Similarly the value of $a$ can
3751 With both optimizations in place the algorithm is the algorithm Barrett proposed.  It requires $m^2 + 2m - 1$ single precision multiplications which
3795 This algorithm will reduce the input $a$ modulo $b$ in place using the Barrett algorithm.  It is loosely based on algorithm 14.42 of HAC
3796 \cite[pp.  602]{HAC} which is based on the paper from Paul Barrett \cite{BARRETT}.  The algorithm has several restrictions and assumptions which must
3797 be adhered to for the algorithm to work.
3802 Technically the algorithm will still work if $a \ge b^2$ but it will take much longer to finish.  The value of $\mu$ is passed as an argument to this
3803 algorithm and is assumed to be calculated and stored before the algorithm is used.
3805 Recall that the multiplication for the quotient on step 3 must only produce digits at or above the $m-1$'th position.  An algorithm called
3806 $s\_mp\_mul\_high\_digs$ which has not been presented is used to accomplish this task.  The algorithm is based on $s\_mp\_mul\_digs$ except that
3807 instead of stopping at a given level of precision it starts at a given level of precision.  This optimal algorithm can only be used if the number
3815 The while loop at step 9 will subtract $b$ until the residue is less than $b$.  If the algorithm is performed correctly this step is
3827 in the modulus.  In the source code this is evaluated on lines 36 to 44 where algorithm s\_mp\_mul\_high\_digs is used when it is
3831 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance.  Ideally this value should be computed once and stored for
3832 future use so that the Barrett algorithm can be used without delay.
3853 This algorithm computes the reciprocal $\mu$ required for Barrett reduction.  First $\beta^{2m}$ is calculated as $2^{2 \cdot lg(\beta) \cdot  m}$ which
3863 This simple routine calculates the reciprocal $\mu$ required by Barrett reduction.  Note the extended usage of algorithm mp\_div where the variable
3868 Montgomery reduction\footnote{Thanks to Niels Ferguson for his insightful explanation of the algorithm.} \cite{MONT} is by far the most interesting
3870 residue times a constant.  However, as perplexing as this may sound the algorithm is relatively simple and very efficient.
3873 $n$ must be odd.  The variable $x$ will represent the quantity of which the residue is sought.  Similar to the Barrett algorithm the input
3883 From these two simple facts the following simple algorithm can be derived.
3905 The algorithm reduces the input one bit at a time using the two congruencies stated previously.  Inside the loop $n$, which is odd, is
3908 final result of the Montgomery algorithm.  If $k > lg(n)$ and $0 \le x < n^2$ then the final result is limited to
3934 the algorithm $r = 178$ is congruent to the value of $2^{-9} \cdot 5555 \mbox{ (mod }257\mbox{)}$.  When $r$ is multiplied by $2^9$ modulo $257$ the correct residue
3937 Let $k = \lfloor lg(n) \rfloor + 1$ represent the number of bits in $n$.  The current algorithm requires $2k^2$ single precision shifts
3938 and $k^2$ single precision additions.  At this rate the algorithm is most certainly slower than Barrett reduction and not terribly useful.
3939 Fortunately there exists an alternative representation of the algorithm.
3960 This algorithm is equivalent since $2^tn$ is a multiple of $n$ and the lower $k$ bits of $x$ are zero by step 2.  The number of single
3987 Figure~\ref{fig:MONT2} demonstrates the modified algorithm reducing $x = 5555$ modulo $n = 257$ with $k = 9$.
3988 With this algorithm a single shift right at the end is the only right shift required to reduce the input instead of $k$ right shifts inside the
3994 previous algorithm re-written to compute the Montgomery reduction in this new fashion.
4027 extensively in this algorithm and should be precomputed.  Let $\rho$ represent the negative of the modular inverse of $n_0$ modulo $\beta$.
4047 the algorithm.  To get the true residue the value must be multiplied by $\beta^k$.  In this case $\beta^k \equiv 15 \mbox{ (mod }n\mbox{)}$ and
4051 The baseline Montgomery reduction algorithm will produce the residue for any size input.  It is designed to be a catch-all algororithm for
4065 \hspace{3mm}2.1  Use algorithm fast\_mp\_montgomery\_reduce instead. \\
4098 This algorithm reduces the input $x$ modulo $n$ in place using the Montgomery reduction algorithm.  The algorithm is loosely based
4099 on algorithm 14.32 of \cite[pp.601]{HAC} except it merges the multiplication of $\mu n \beta^t$ with the addition in the inner loop.  The
4100 restrictions on this algorithm are fairly easy to adapt to.  First $0 \le x < n^2$ bounds the input to numbers in the same range as
4101 for the Barrett algorithm.  Additionally if $n > 1$ and $n$ is odd there will exist a modular inverse $\rho$.  $\rho$ must be calculated in
4102 advance of this algorithm.  Finally the variable $k$ is fixed and a pseudonym for $n.used$.
4104 Step 2 decides whether a faster Montgomery algorithm can be used.  It is based on the Comba technique meaning that there are limits on
4105 the size of the input.  This algorithm is discussed in sub-section 6.3.3.
4107 Step 5 is the main reduction loop of the algorithm.  The value of $\mu$ is calculated once per iteration in the outer loop.  The inner loop
4111 Using a quick inspection this algorithm requires $n$ single precision multiplications for the outer loop and $n^2$ single precision multiplications
4122 This is the baseline implementation of the Montgomery reduction algorithm.  Lines 31 to 36 determine if the Comba based
4131 nature of the inner loop.  The Barrett reduction algorithm requires two slightly modified multipliers which can be implemented with the Comba
4132 technique.  The Montgomery reduction algorithm cannot directly use the Comba technique to any significant advantage since the inner loop calculates
4139 With this change in place the Montgomery reduction algorithm can be performed with a Comba style multiplication loop which substantially increases
4140 the speed of the algorithm.
4187 This algorithm will compute the Montgomery reduction of $x$ modulo $n$ using the Comba technique.  It is on most computer platforms significantly
4188 faster than algorithm mp\_montgomery\_reduce and algorithm mp\_reduce (\textit{Barrett reduction}).  The algorithm has the same restrictions
4189 on the input as the baseline reduction algorithm.  An additional two restrictions are imposed on this algorithm.  The number of digits $k$ in the
4190 the modulus $n$ must not violate $MP\_WARRAY > 2k +1$ and $n < \delta$.   When $\beta = 2^{28}$ this algorithm can be used to reduce modulo
4226 To calculate the variable $\rho$ a relatively simple algorithm will be required.
4251 This algorithm will calculate the value of $\rho$ required within the Montgomery reduction algorithms.  It uses a very interesting trick
4287 into the equation the original congruence is reproduced, thus concluding the proof.  The following algorithm is based on this observation.
4313 This algorithm will reduce $x$ modulo $n - k$ and return the residue.  If $0 \le x < (n - k)^2$ then the algorithm will loop almost always
4334 sum in step 4 will exceed $n - k$.  In practice fewer than three passes of the algorithm are required to reduce virtually every input in the
4369 is considerably larger than $(n - k - 1)^2 = 63504$ the algorithm still converges on the modular residue exceedingly fast.  In this case only
4374 On the surface this algorithm looks like a very expensive algorithm.  It requires a couple of subtractions followed by multiplication and other
4375 modular reductions.  The usefulness of this algorithm becomes exceedingly clear when an appropriate modulus is chosen.
4393 as well be a single digit.  The smaller the value of $k$ is the faster the algorithm will be.
4396 The restricted Diminished Radix algorithm can quickly reduce an input modulo a modulus of the form $n = \beta^p - k$.  This algorithm can reduce
4397 an input $x$ within the range $0 \le x < n^2$ using only a couple passes of the algorithm demonstrated in figure~\ref{fig:DR}.  The implementation
4398 of this algorithm has been optimized to avoid additional overhead associated with a division by $\beta^p$, the multiplication by $k$ or the addition
4399 of $x$ and $q$.  The resulting algorithm is very efficient and can lead to substantial improvements over Barrett and Montgomery reduction when modular
4434 This algorithm will perform the Dimished Radix reduction of $x$ modulo $n$.  It has similar restrictions to that of the Barrett reduction
4437 This algorithm essentially implements the pseudo-code in figure~\ref{fig:DR} except with a slight optimization.  The division by $\beta^m$, multiplication by $k$
4443 At step 8 if $x$ is still larger than $n$ another pass of the algorithm is required.  First $n$ is subtracted from $x$ and then the algorithm resumes
4454 the algorithm will resume if further reduction passes are required.  In theory it could be placed at the top of the function however, the size of
4458 a division by $\beta^m$ can be simulated virtually for free.  The loop on line 64 performs the bulk of the work (\textit{corresponds to step 4 of algorithm 7.11})
4459 in this algorithm.
4464 Since the algorithm is only valid if both $x$ and $n$ are greater than zero an unsigned comparison suffices to determine if another pass is required.
4466 as well.  Since the destination of the subtraction is the larger of the inputs the call to algorithm s\_mp\_sub cannot fail and the return code
4470 To setup the restricted Diminished Radix algorithm the value $k = \beta - n_0$ is required.  This algorithm is not really complicated but provided for
4497 Another algorithm which will be useful is the ability to detect a restricted Diminished Radix modulus.  An integer is said to be
4520 This algorithm determines if a value is in Diminished Radix form.  Step 1 rejects obvious cases where fewer than two digits are
4521 in the mp\_int.  Step 2 tests all but the first digit to see if they are equal to $\beta - 1$.  If the algorithm manages to get to
4532 The unrestricted Diminished Radix algorithm allows modular reductions to be performed when the modulus is of the form $2^p - k$.  This algorithm
4533 is a straightforward adaptation of algorithm~\ref{fig:DR}.
4535 In general the restricted Diminished Radix reduction algorithm is much faster since it has considerably lower overhead.  However, this new
4536 algorithm is much faster than either Montgomery or Barrett reduction when the moduli are of the appropriate form.
4564 This algorithm quickly reduces an input $a$ modulo an unrestricted Diminished Radix modulus $n$.  Division by $2^p$ is emulated with a right
4565 shift which makes the algorithm fairly inexpensive to use.
4574 The algorithm mp\_count\_bits calculates the number of bits in an mp\_int which is used to find the initial value of $p$.  The call to mp\_div\_2d
4583 To setup this reduction algorithm the value of $k = 2^p - n$ is required.
4606 This algorithm computes the value of $k$ required for the algorithm mp\_reduce\_2k.  By making a temporary variable $x$ equal to $2^p$ a subtraction
4651 This algorithm quickly determines if a modulus is of the form required for algorithm mp\_reduce\_2k to function properly.
4683 For almost every cryptographic algorithm Montgomery reduction is the algorithm of choice.  The one set of algorithms where Diminished Radix reduction truly
4685 primes of the form $\beta^m - k$ can be found and shared amongst users.  These primes will allow the Diminished Radix algorithm to be used in
4692 $\left [ 3 \right ]$ & Prove that the ``trick'' in algorithm mp\_montgomery\_setup actually \\
4695 $\left [ 2 \right ]$ & Devise an algorithm to reduce modulo $n + k$ for small $k$ quickly.  \\
4697 $\left [ 4 \right ]$ & Prove that the pseudo-code algorithm ``Diminished Radix Reduction'' \\
4711 A trivial algorithm would simply multiply $a$ against itself $b - 1$ times to compute the exponentiation desired.  However, as $b$ grows in size
4715 Fortunately there is a very simple algorithm based on the laws of exponents.  Recall that $lg_a(a^b) = b$ and that $lg_a(a^ba^c) = b + c$ which
4734 be computed in an auxilary variable.  Consider the following equivalent algorithm.
4757 This algorithm starts from the most significant bit and works towards the least significant bit.  When the $i$'th bit of $b$ is set $a$ is
4761 For example, let $b = 101100_2 \equiv 44_{10}$.  The following chart demonstrates the actions of the algorithm.
4780 When the product $a^{32} \cdot a^8 \cdot a^4$ is simplified it is equal $a^{44}$ which is the desired exponentiation.  This particular algorithm is
4784 The first algorithm in the series of exponentiation algorithms will be an unbounded algorithm where the exponent is a single digit.  It is intended
4813 This algorithm computes the value of $a$ raised to the power of a single digit $b$.  It uses the left to right exponentiation algorithm to
4814 quickly compute the exponentiation.  It is loosely based on algorithm 14.79 of HAC \cite[pp. 615]{HAC} with the difference that the
4839 slower than squaring.  Recall from the previous algorithm that $b_{i}$ refers to the $i$'th bit of the exponent $b$.  Suppose instead it referred to
4840 the $i$'th $k$-bit digit of the exponent of $b$.  For $k = 1$ the definitions are synonymous and for $k > 1$ algorithm~\ref{fig:KARY}
4842 portion of the entire exponent.  Consider the following modification to the basic left to right exponentiation algorithm.
4867 precomputed this algorithm requires only $t$ multiplications and $tk$ squarings.  The table can be generated with $2^{k - 1} - 1$ squarings and
4868 $2^{k - 1} + 1$ multiplications.  This algorithm assumes that the number of bits in the exponent is evenly divisible by $k$.
4869 However, when it is not the remaining $0 < x \le k - 1$ bits can be handled with algorithm~\ref{fig:LTOR}.
4871 Suppose $k = 4$ and $t = 100$.  This modified algorithm will require $109$ multiplications and $408$ squarings to compute the exponentiation.  The
4872 original algorithm would on average have required $200$ multiplications and $400$ squrings to compute the same value.  The total number of squarings
4878 for various exponent sizes and compares the number of multiplication and squarings required against algorithm~\ref{fig:LTOR}.
4903 A simple modification to the previous algorithm is only generate the upper half of the table in the range $2^{k-1} \le g < 2^k$.  Essentially
4905 algorithm values of $g$ in the range $0 \le g < 2^{k-1}$ must be avoided.
4907 Table~\ref{fig:OPTK2} lists optimal values of $k$ for various exponent sizes and compares the work required against algorithm~\ref{fig:KARY}.
4956 Similar to the previous algorithm this algorithm must have a special handler when fewer than $k$ bits are left in the exponent.  While this
4957 algorithm requires the same number of squarings it can potentially have fewer multiplications.  The pre-computed table $a^g$ is also half
4960 Consider the exponent $b = 111101011001000_2 \equiv 31432_{10}$ with $k = 3$ using both algorithms.  The first algorithm will divide the exponent up as
4961 the following five $3$-bit words $b \equiv \left ( 111, 101, 011, 001, 000 \right )_{2}$.  The second algorithm will break the
4977 Before the actual modular exponentiation algorithm can be written a wrapper algorithm must be written first.  This algorithm
4979 value of $(1/a) \mbox{ mod }c$ is computed using the modular inverse (\textit{see \ref{sec;modinv}}).  If no inverse exists the algorithm
4996 \hspace{3mm}3.1  Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm mp\_exptmod\_fast. \\
4998 \hspace{3mm}4.1  Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm s\_mp\_exptmod. \\
5007 The first algorithm which actually performs modular exponentiation is algorithm s\_mp\_exptmod.  It is a sliding window $k$-ary algorithm
5008 which uses Barrett reduction to reduce the product modulo $p$.  The second algorithm mp\_exptmod\_fast performs the same operation
5010 algorithm since their arguments are essentially the same (\textit{two mp\_ints and one mp\_digit}).
5020 negative the algorithm tries to perform a modular exponentiation with the modular inverse of the base $G$.  The temporary variable $tmpG$ is assigned
5021 the modular inverse of $G$ and $tmpX$ is assigned the absolute value of $X$.  The algorithm will recuse with these new values with a positive
5024 If the exponent is positive the algorithm resumes the exponentiation.  Line 77 determines if the modulus is of the restricted Diminished Radix
5034 Line 69 determines if the fast modular exponentiation algorithm can be used.  It is allowed if $dr \ne 0$ or if the modulus is odd.  Otherwise,
5035 the slower s\_mp\_exptmod algorithm is used which uses Barrett reduction.
5137 This algorithm computes the $x$'th power of $g$ modulo $p$ and stores the result in $y$.  It takes advantage of the Barrett reduction
5138 algorithm to keep the product small throughout the algorithm.
5145 the rest of the algorithm more efficient.  The first element of the table at $2^{winsize - 1}$ is found by squaring $M_1$ successively $winsize - 2$
5156    \item When $mode = 2$ the algorithm is in the middle of forming a window and new bits are appended to the window from the most significant bit
5177 algorithm from having to perform trivial squaring and reduction operations before the first non-zero bit is read.  Step 12.6 and 12.7-10 handle
5189 a Left-to-Right algorithm is used to process the remaining few bits.
5256 Integer division aside from modular exponentiation is the most intensive algorithm to compute.  Like addition, subtraction and multiplication
5257 the basis of this algorithm is the long-hand division algorithm taught to school children.  Throughout this discussion several common variables
5259 let $r$ represent the remainder $r = y - x \lfloor y / x \rfloor$.  The following simple algorithm will be used to start the discussion.
5285 As children we are taught this very simple algorithm for the case of $\beta = 10$.  Almost instinctively several optimizations are taught for which
5449 This algorithm will calculate quotient and remainder from an integer division given a dividend and divisor.  The algorithm is a signed
5456 divisor $y$ and dividend $x$ are made as well.  The core of the division algorithm is an unsigned division and will only work if the values are
5460 At this point the division algorithm can begin producing digits of the quotient.  Recall that maximum value of the estimation used is
5476 algorithm~\ref{fig:raddiv}} and then subsequently add a multiple of the divisor if the quotient was too large.
5479 remainder.  An important aspect of this algorithm seemingly overlooked in other descriptions such as that of Algorithm 14.20 HAC \cite[pp. 598]{HAC}
5491 The implementation of this algorithm differs slightly from the pseudo code presented previously.  In this algorithm either of the quotient $c$ or
5493 algorithm with only the quotient is
5505 exactly what is required.  For the algorithm to operate $k$ must equal $lg(\beta) - 1$ and when it does not the inputs must be normalized by shifting
5511 The conditional ``continue'' on line 187 is used to prevent the algorithm from reading past the leading edge of $x$ which can occur when the
5512 algorithm eliminates multiple non-zero digits in a single iteration.  This ensures that $x_i$ is always non-zero since by definition the digits
5547 This algorithm initiates a temporary mp\_int with the value of the single digit and uses algorithm mp\_add to add the two values together.
5559 The single digit subtraction algorithm mp\_sub\_d is essentially the same except it uses mp\_sub to subtract the digit from the mp\_int.
5563 multiplication algorithm.  Essentially this algorithm is a modified version of algorithm s\_mp\_mul\_digs where one of the multiplicands
5596 This algorithm quickly multiplies an mp\_int by a small single digit value.  It is specially tailored to the job and has a minimal of overhead.
5597 Unlike the full multiplication algorithms this algorithm does not require any significnat temporary storage or memory allocations.
5610 Like the single digit multiplication algorithm, single digit division is also a fairly common algorithm used in radix conversion.  Since the
5611 divisor is only a single digit a specialized variant of the division algorithm can be used to compute the quotient.
5622 2.  If $b = 3$ then use algorithm mp\_div\_3 instead. \\
5646 This algorithm divides the mp\_int $a$ by the single mp\_digit $b$ using an optimized approach.  Essentially in every iteration of the
5647 algorithm another digit of the dividend is reduced and another digit of quotient produced.  Provided $b < \beta$ the value of $\hat w$
5650 If the divisor $b$ is equal to three a variant of this algorithm is used which is called mp\_div\_3.  It replaces the division by three with
5661 Like the implementation of algorithm mp\_div this algorithm allows either of the quotient or remainder to be passed as a \textbf{NULL} pointer to
5662 indicate the respective value is not required.  This allows a trivial single digit modular reduction algorithm, mp\_mod\_d to be created.
5679 simply $f'(x) = nx^{n - 1}$.  Of particular importance is that this algorithm will be used over the integers not over the a more continuous domain
5681 algorithm the $n$'th root $b$ of an integer $a$ is desired such that $b^n \le a$.
5720 This algorithm finds the integer $n$'th root of an input using the Newton-Raphson approach.  It is partially optimized based on the observation
5725 The initial value of the approximation is t$2 = 2$ which allows the algorithm to start with very small values and quickly converge on the
5726 root.  Ideally this algorithm is meant to find the $n$'th root of an input where $n$ is bounded by $2 \le n \le 5$.
5738 factoring for example, can make use of random values as starting points to find factors of a composite integer.  In this case the algorithm presented
5765 This algorithm produces a pseudo-random integer of $b$ digits.  By ensuring that the first digit is non-zero the algorithm also guarantees that the
5846 This algorithm will read an ASCII string and produce the radix-$\beta$ mp\_int representation of the same integer.  A minus symbol ``-'' may precede the
5847 string  to indicate the value is negative, otherwise it is assumed to be positive.  The algorithm will read up to $sn$ characters from the input
5848 and will stop when it reads a character it cannot map the algorithm stops reading characters from the string.  This allows numbers to be embedded
5859 Generating radix-$n$ output is fairly trivial with a division and remainder algorithm.
5893 This algorithm computes the radix-$r$ representation of an mp\_int $a$.  The ``digits'' of the representation are extracted by reducing
5924 symbol computation.  These algorithms arise as essential components in several key cryptographic algorithms such as the RSA public key algorithm and
5956 This algorithm will quickly converge on the greatest common divisor since the residue $r$ tends diminish rapidly.  However, divisions are
5982 The algorithm in figure~\ref{fig:gcd2} will eventually terminate since $b \ge a$ the subtraction in step 1.2 will be a value less than $b$.  In other
5984 divisible by the greatest common divisor (\textit{until the last iteration}) and in the last iteration of the algorithm $b = 0$, therefore, in the
5985 second to last iteration of the algorithm $b = a$ and clearly $(a, a) = a$ which concludes the proof.  \textbf{QED}.
5987 As a matter of practicality algorithm \ref{fig:gcd1} decreases far too slowly to be useful.  Specially if $b$ is much larger than $a$ such that
5988 $b - a$ is still very much larger than $a$.  A simple addition to the algorithm is to divide $b - a$ by a power of some integer $p$ which does
6026 This algorithm is based on the first except it removes powers of $p$ first and inside the main loop to ensure the tuple $\left < a, b \right >$
6037 The algorithms presented so far cannot handle inputs which are zero or negative.  The following algorithm can handle all input cases properly
6079 This algorithm will produce the greatest common divisor of two mp\_ints $a$ and $b$.  The algorithm was originally based on Algorithm B of
6085 $a$ and $b$ respectively and the algorithm will proceed to reduce the pair.
6118 any independent factors of two such that both $u$ and $v$ are guaranteed to be an odd integer before hitting the main body of the algorithm.  The while loop
6151 This algorithm computes the least common multiple of two mp\_int inputs $a$ and $b$.  It computes the least common multiple directly by
6217 further details.} will be used to derive an efficient Jacobi symbol algorithm.  Where $p$ is an odd integer greater than two and $a, b \in \Z$ the
6257 factors of $p$ do not have to be known.  Furthermore, if $(a, p) = 1$ then the algorithm will terminate when the recursion requests the
6301 This algorithm computes the Jacobi symbol for an arbitrary positive integer $a$ with respect to an odd integer $p$ greater than three.  The algorithm
6302 is based on algorithm 2.149 of HAC \cite[pp. 73]{HAC}.
6323 The two simple cases of $a = 0$ and $a = 1$ are handled at the very beginning to simplify the algorithm.  If the input is non-trivial the algorithm
6328 bit of $k$ is required, however, it makes the algorithm simpler to follow to perform an addition. In practice an exclusive-or and addition have the same
6335 Finally, if $a1$ does not equal one the algorithm must recurse and compute $\left ( {p' \over a'} \right )$.
6364 $a$ modulo $p$.  The extended Euclidean algorithm (Knuth \cite[pp. 342]{TAOCPV2}) can be used to solve such equations provided $(a, p) = 1$.
6365 However, instead of using that algorithm directly a variant known as the binary Extended Euclidean algorithm will be used in its place.  The
6366 binary approach is very similar to the binary greatest common divisor algorithm except it will produce a full solution to the Diophantine
6379 2.  If $b_0 \equiv 1 \mbox{ (mod }2\mbox{)}$ then use algorithm fast\_mp\_invmod. \\
6419 This algorithm computes the modular multiplicative inverse of an integer $a$ modulo an integer $b$.  This algorithm is a variation of the
6420 extended binary Euclidean algorithm from HAC \cite[pp. 608]{HAC}.  It has been modified to only compute the modular inverse and not a complete
6426 The astute reader will observe that steps seven through nine are very similar to the binary greatest common divisor algorithm mp\_gcd.  In this case
6427 the other variables to the Diophantine equation are solved.  The algorithm terminates when $u = 0$ in which case the solution is
6433 If $v$, the greatest common divisor of $a$ and $b$ is not equal to one then the algorithm will report an error as no inverse exists.  Otherwise, $C$
6450 The algorithm fast\_mp\_invmod is a direct adaptation of algorithm mp\_invmod with all all steps involving either $A$ or $C$ removed.  This
6461 prime the algorithm may be incorrect.
6473 of the primes less than $\sqrt{n} + 1$ the algorithm cannot prove if a candidate is prime.  However, often it can prove a candidate is not prime.
6507 This algorithm attempts to determine if a candidate integer $n$ is composite by performing trial divisions.
6516 The algorithm defaults to a return of $0$ in case an error occurs.  The values in the prime table are all specified to be in the range of a
6562 This algorithm determines whether an mp\_int $a$ is a Fermat prime to the base $b$ or not.  It uses a single modular exponentiation to
6574 candidate  integers.  The algorithm is based on the observation that if $n - 1 = 2^kr$ and if $b^r \nequiv \pm 1$ then after upto $k - 1$ squarings the
6609 This algorithm performs one trial round of the Miller-Rabin algorithm to the base $b$.  It will set $c = 1$ if the algorithm cannot determine
6612 If the value $y \equiv b^r$ is congruent to $\pm 1$ then the algorithm cannot prove if $a$ is composite or not.  Otherwise, the algorithm will
6613 square $y$ upto $s - 1$ times stopping only when $y \equiv -1$.  If $y^2 \equiv 1$ and $y \nequiv \pm 1$ then the algorithm can report that $a$
6614 is provably composite.  If the algorithm performs $s - 1$ squarings and $y \nequiv -1$ then $a$ is provably composite.  If $a$ is not provably