lib/tommath/tommath.tex

83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{}
90 They ask why I did it and especially why I continue to work on them for free.  The best I can explain it is ``Because I can.''
92 perhaps explains it better.  I am the first to admit there is not anything that special with what I have done.  Perhaps
105 length (along with the source code) is a tiresome and lengthy process.  Currently the LibTom project is four years old,
107 were there at the beginning to encourage me to work well.  It is amazing how timely validation from others can boost morale to
121 This book is Tom's child and he has been caring and fostering the project ever since the beginning with a clear mind of
130 multiple precision calculations is often very important since we deal with outdated machine architecture where modular
133 This text is for people who stop and wonder when first examining algorithms such as RSA for the first time and asks
134 themselves, ``You tell me this is only secure for large numbers, fine; but how do you implement these numbers?''
156 friends go through the mill of publication, so I think that the way to go is longer than Tom thinks it is. Nevertheless, it's a good effort,
168 \subsection{What is Multiple Precision Arithmetic?}
171 reason that $7$ times $6$ is $42$.  However, $42$ has two digits of precision as opposed to one digit we started with.
183 The most prevalent need for multiple precision arithmetic, often referred to as ``bignum'' math, is within the implementation
205 language\footnote{As per the ISO C standard.  However, each compiler vendor is allowed to augment the precision as they
206 see fit.}  can only represent values up to $10^{19}$ as shown in figure \ref{fig:ISOC}. On its own the C language is
218 However, cryptography is not the only field of study that can benefit from fast multiple precision integer routines.
219 Another auxiliary use of multiple precision integers is high precision floating point data types.
220 The basic IEEE \cite{IEEE} standard floating point type is made up of an integer mantissa $q$, an exponent $e$ and a sign bit $s$.
221 Numbers are given in the form $n = q \cdot b^e \cdot -1^s$ where $b = 2$ is the most common base for IEEE.  Since IEEE
222 floating point is meant to be implemented in hardware the precision of the mantissa is often fairly small
223 (\textit{23, 48 and 64 bits}).  The mantissa is merely an integer and a multiple precision integer could be used to create
227 Yet another use for large integers is within arithmetic on polynomials of large characteristic (i.e. $GF(p)[x]$ for large $p$).
232 The benefit of multiple precision representations over single or fixed precision representations is that
233 no precision is lost while representing the result of an operation which requires excess precision.  For example,
238 It is possible to implement algorithms which require large integers with fixed precision algorithms.  For example, elliptic
239 curve cryptography (\textit{ECC}) is often implemented on smartcards by fixing the precision of the integers to the maximum
242 processor has an 8 bit accumulator.}.  However, as efficient as such an approach may be, the resulting source code is not
246 overhead can be kept to a minimum with careful planning, but overall, it is not well suited for most memory starved
248 inputs.  That is, the same algorithms based on multiple precision integers can accomodate any reasonable size input
253 The purpose of this text is to instruct the reader regarding how to implement efficient multiple precision algorithms.
254 That is to not only explain a limited subset of the core theory behind the algorithms but also the various ``house keeping''
259 In most cases how an algorithm is explained and how it is actually implemented are two very different concepts.  For
262 the fact that the two integer inputs may be of differing magnitudes.  As a result the implementation is not as simple
270 To solve this problem the focus of this text is on the practical aspects of implementing a multiple precision integer
271 package.  As a case study the ``LibTomMath''\footnote{Available at \url{http://math.libtomcrypt.com}} package is used
273 tested and work very well.  The LibTomMath library is freely available on the Internet for all uses and this text
294 synonymous.  When an algorithm is specified to accept an mp\_int variable it is assumed the various auxliary data members
301 to solve a given problem.  When an algorithm is described as accepting an integer input it is assumed the input is
302 a plain integer with no additional multiple-precision members.  That is, algorithms that use integers as opposed to
312 carry.  Since all modern computers are binary, it is assumed that $q$ is two.
319 the $j$'th digit of a double precision array.  Whenever an expression is to be assigned to a double precision
320 variable it is assumed that all single precision variables are promoted to double precision during the evaluation.
333 as indicated.  The only exception to this rule is when variables have been indicated to be of type mp\_int.  This
334 distinction is important as scalars are often used as array indicies and various other counters.
340 the $/$ division symbol is used the intention is to perform an integer division with truncation.  For example,
341 $5/2 = 2$ which will often be written as $\lfloor 5/2 \rfloor = 2$ for clarity.  When an expression is written as a
342 fraction a real value division is implied, for example ${5 \over 2} = 2.5$.
349 To measure the efficiency of the specified algorithms, a modified big-Oh notation is used.  In this system all
351 That is a single precision addition, multiplication and division are assumed to take the same time to
352 complete.  While this is generally not true in practice, it will simplify the discussions considerably.
354 Some algorithms have slight advantages over others which is why some constants will not be removed in
358 in the context of the this text this is not the case as the magnitude of the inputs will typically be rather small.  As a
361 All of the algorithms presented in this text have a polynomial time work level.  That is, of the form
369 chapters.  The reader is encouraged to finish the exercises as they appear to get a better understanding of the
407 devising new theory.  These problems are quick tests to see if the material is understood.  Problems at the second level
411 Problems at the third level are meant to be a bit more difficult than the first two levels.  The answer is often
424 Often problems will be tied together.  The purpose of this is to start a chain of thought that will be discussed in future chapters.  The reader
425 is encouraged to answer the follow-up problems and try to draw the relevance of problems.
429 \subsection{What is LibTomMath?}
430 LibTomMath is a free and open source multiple precision integer library written entirely in portable ISO C.  By portable it
431 is meant that the library does not contain any code that is computer platform dependent or otherwise problematic to use on
436 as the Gameboy Advance.  The library is designed to contain enough functionality to be able to develop applications such
442 even though this library is written entirely in ISO C, considerable care has been taken to optimize the algorithm implementations within the
449 algorithms automatically without the developer's specific attention.  One such example is the generic multiplication
453 Making LibTomMath as efficient as possible is not the only goal of the LibTomMath project.  Ideally the library should
455 MPI library was used as a API template for all the basic functions.  MPI was chosen because it is another library that fits
459 The project is also meant to act as a learning tool for students, the logic being that no easy-to-follow ``bignum''
464 LibTomMath was chosen as the case study of this text not only because the author of both projects is one and the same but
470 The LibTomMath code base is all portable ISO C source code.  This means that there are no platform dependent conditional
475 The code base of LibTomMath is well organized.  Each function is in its own separate source code file
480 When compiled with GCC for the x86 processor and optimized for speed the entire library is approximately $100$KiB\footnote{The notation ``KiB'' means $2^{10}$ octets, similarly ``MiB'' means $2^{20}$ octets.}
481  which is fairly small compared to GMP (over $250$KiB).  LibTomMath is slightly larger than MPI (which compiles to about
482 $50$KiB) but LibTomMath is also much faster and more complete than MPI.
485 LibTomMath is designed after the MPI library and shares the API design.  Quite often programs that use MPI will build
487 functions share the same parameter passing convention.  The learning curve is fairly shallow with the API provided
488 which is an extremely valuable benefit for the student and developer alike.
490 The LIP library is an example of a library with an API that is awkward to work with.  LIP uses function names that are often ``compressed'' to
495 effect a math error (i.e. invalid input, heap error, etc) can cause a program to stop functioning which is definitely
499 While LibTomMath is certainly not the fastest library (GMP often beats LibTomMath by a factor of two) it does
505 LibTomMath is almost always an order of magnitude faster than the MPI library at computationally expensive tasks such as modular
506 exponentiation.  In the grand scheme of ``bignum'' libraries LibTomMath is faster than the average library and usually
519 LibTomMath is a relatively compact, well documented, highly optimized and portable library which seems only natural for
521 the reader is encouraged to download their own copy of the library to actually be able to work with the library.
525 The trick to writing any useful library of source code is to build a solid foundation and work outwards from it.  First,
527 inability to accomodate multiple precision integers is the problem.  Futhermore, the solution must be written
528 as portable source code that is reasonably efficient across several different computer platforms.
530 After a foundation is formed the remainder of the library can be designed and implemented in a hierarchical fashion.
531 That is, to implement the lowest level dependencies first and work towards the most abstract functions last.  For example,
533 By building outwards from a base foundation instead of using a parallel design methodology the resulting project is
534 highly modular.  Being highly modular is a desirable property of any project as it often means the resulting product
561 \section{What is a Multiple Precision Integer?}
563 be used to represent values larger than their precision will allow. The purpose of multiple precision algorithms is
568 the largest single digit value is $9$.  However, by concatenating digits together larger numbers may be represented.  Newly prepended digits
569 (\textit{to the left}) are said to be in a different power of ten column.  That is, the number $123$ can be described as having a $1$ in the hundreds
571 multiple precision arithmetic is essentially the same concept.  Larger integers are represented by adjoining fixed
572 precision computer words with the exception that a different radix is used.
575 integer.  For example, the integer $154_{10}$ has two immediately obvious properties.  First, the integer is positive,
576 that is the sign of this particular integer is positive as opposed to negative.  Second, the integer has three digits in
577 its representation.  There is an additional property that the integer posesses that does not concern pencil-and-paper
578 arithmetic.  The third property is how many digits placeholders are available to hold the integer.
580 The human analogy of this third property is ensuring there is enough space on the paper to write the integer.  For example,
583 will not exceed the allowed boundaries.  These three properties make up what is known as a multiple precision
588 The mp\_int structure is the ISO C based manifestation of what represents a multiple precision integer.  The ISO C standard does not provide for
589 any such data type but it does provide for making composite data types known as structures.  The following is the structure definition
624 precision integer.  It is padded with $(\textbf{alloc} - \textbf{used})$ zero digits.  The array is maintained in a least
625 significant digit order.  As a pencil and paper analogy the array is organized such that the right most digits are stored
636 The only exceptions are when the structure is passed to initialization functions such as mp\_init() and mp\_init\_copy().
639 \item The value of \textbf{alloc} may not be less than one.  That is \textbf{dp} always points to a previously allocated
642 \item The value of \textbf{used} implies the digit at index $(used - 1)$ of the \textbf{dp} array is non-zero.  That is,
647 \item The value of \textbf{sign} must be \textbf{MP\_ZPOS} if \textbf{used} is zero;
664 The left to right order is a fairly natural way to implement the functions since it lets the developer read aloud the
668 of assignment expressions.  That is, the destination (output) is on the left and arguments (inputs) are on the right.  In
669 truth, it is entirely a matter of preference.  In the case of LibTomMath the convention from the MPI library has been
672 Another very useful design consideration, provided for in LibTomMath, is whether to allow argument sources to also be a
673 destination.  For example, the second example (\textit{mp\_add}) adds $a$ to $b$ and stores in $a$.  This is an important
675 However, to implement this feature specific care has to be given to ensure the destination is not modified before the
676 source is fully read.
685 instance) and memory allocation errors.  It will not check that the mp\_int passed to any function is valid nor
704 When an error is detected within a function it should free any memory it allocated, often during the initialization of
705 temporary mp\_ints, and return as soon as possible.  The goal is to leave the system in the same state it was when the
706 function was called.  Error checking with this style of API is fairly simple.
716 The GMP \cite{GMP} library uses C style \textit{signals} to flag errors which is of questionable use.  Not all errors are fatal
720 The logical starting point when actually writing multiple precision integer functions is the initialization and
724 the integer.  Often it is optimal to allocate a sufficiently large pre-set number of digits even though
726 would occur when operations are performed on the integers.  There is a tradeoff between how many default digits to allocate
731 be initialized.  Since the initial state of an mp\_int is to represent the zero integer, the allocated digits must be set
735 An mp\_int is said to be initialized if it is set to a valid, preferably default, state such that all of the members of the
761 The purpose of this function is to initialize an mp\_int structure so that the rest of the library can properly
762 manipulte it.  It is assumed that the input may not have had any of its members previously initialized which is certainly
766 the digits is allocated.  If this fails the function returns before setting any of the other members.  The \textbf{MP\_PREC}
768 used to dictate the minimum precision of newly initialized mp\_int integers.  Ideally, it is at least equal to the smallest
772 heap operations later functions will have to perform in the future.  If \textbf{MP\_PREC} is set correctly the slack
781 when the ``to'' keyword is placed between two expressions.  For example, ``for $a$ from $b$ to $c$ do'' means that
783 iteration the variable $a$ is substituted for a new integer that lies inclusively between $b$ and $c$.  If $b > c$ occured
794 One immediate observation of this initializtion function is that it does not return a pointer to a mp\_int structure.  It
795 is assumed that the caller has already allocated memory for the mp\_int structure, typically on the application stack.  The
796 call to mp\_init() is used only to initialize the members of the structure to a known default state.
798 Here we see (line 24) the memory allocation is performed first.  This allows us to exit cleanly and quickly
799 if there is an error.  If the allocation fails the routine will return \textbf{MP\_MEM} to the caller to indicate there
800 was a memory error.  The function XMALLOC is what actually allocates the memory.  Technically XMALLOC is not a function
801 but a macro defined in ``tommath.h``.  By default, XMALLOC will evaluate to malloc() which is the C library's built--in
804 In order to assure the mp\_int is in a known state the digits must be set to zero.  On most platforms this could have been
811 a success code is returned to the calling function.  If this function returns \textbf{MP\_OKAY} it is safe to assume the
812 mp\_int structure has been properly initialized and is safe to use with other functions within the library.
815 When an mp\_int is no longer required by the application, the memory that has been allocated for its digits must be
841 if a developer accidentally re-uses a cleared structure it is less likely to cause problems.  The second goal
842 is to free the allocated memory.
844 The logic behind the algorithm is extended by marking cleared mp\_int structures so that subsequent calls to this
848 Once an mp\_int has been cleared the mp\_int structure is no longer in a valid state for any other algorithm
859 checks to see if the \textbf{dp} member is not \textbf{NULL}.  If the mp\_int is a valid mp\_int then \textbf{dp} cannot be
863 the digits are assigned zero instead of using block memory operations (such as memset()) since this is more portable.
884 is large enough to simply increase the \textbf{used} digit count.  However, when the size of the array is too small it
892 \textbf{Output}.  $a$ is expanded to accomodate $b$ digits. \\
910 It is ideal to prevent re-allocations from being performed if they are not required (step one).  This is useful to
913 The requested digit count is padded up to next multiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} (steps two and three).
916 It is assumed that the reallocation (step four) leaves the lower $a.alloc$ digits of the mp\_int intact.  This is much
927 A quick optimization is to first determine if a memory re-allocation is required at all.  The if statement (line 24) checks
928 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count.  If the count is not larger than \textbf{alloc}
931 When a re-allocation is performed it is turned into an optimal request to save time in the future.  The requested digit count is
932 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line 25).  The XREALLOC function is used
933 to re-allocate the memory.  As per the other functions XREALLOC is actually a macro which evaluates to realloc by default.  The realloc
935 the re-allocation.  All	that is left is to clear the newly allocated digits and return.
937 Note that the re-allocation result is actually stored in a temporary pointer $tmp$.  This is to allow this function to return
943 of input mp\_ints to a given algorithm.  The purpose of algorithm mp\_init\_size is similar to mp\_init except that it
952 \textbf{Output}.  $a$ is initialized to hold at least $b$ digits. \\
972 digits allocated can be controlled by the second input argument $b$.  The input size is padded upwards so it is a
973 multiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} digits.  This padding is used to prevent trivial
976 Like algorithm mp\_init, the mp\_int structure is initialized to a default state representing the integer zero.  This
977 particular algorithm is useful if it is known ahead of time the approximate size of the input.  If the approximation is
987 The number of digits $b$ requested is padded (line 24) by first augmenting it to the next multiple of
989 mp\_int is placed in a default state representing the integer zero.  Otherwise, the error code \textbf{MP\_MEM} will be
993 \textbf{used} count is set to zero, the \textbf{alloc} count set to the padded digit count and the \textbf{sign} flag set
995 returns succesfully then it is correct to assume that the mp\_int structure is in a valid state for the remainder of the
1000 The purpose of algorithm mp\_init\_multi is to initialize a variable length array of mp\_int structures in a single
1001 statement.  It is essentially a shortcut to multiple initializations.
1008 \textbf{Output}.  The array is initialized such that each mp\_int of $V_k$ is ready to use. \\
1025 (\textit{step 1.2}) all of the previously initialized variables are cleared.  The goal is an ``all or nothing''
1037 ``...'' argument syntax of the C programming language.  The list is terminated with a final \textbf{NULL} argument
1041 $n$ of succesfully initialized mp\_int structures is maintained (line 48) such that if a failure does occur,
1046 When a function anticipates a result will be $n$ digits it is simpler to assume this is true within the body of
1048 $j$ digit produces a result of at most $i + j$ digits.  It is entirely possible that the result is $i + j - 1$
1053 The ideal solution is to always assume the result is $i + j$ and fix up the \textbf{used} count after the function
1054 terminates.  This way a single heap operation (\textit{at most}) is required.  However, if the result was not checked
1059 accumulate to the point the size of the integer would be prohibitive.  As a result even though the precision is very
1060 low the representation is excessively large.
1062 The mp\_clamp algorithm is designed to solve this very problem.  It will trim high-order zeros by decrementing the
1063 \textbf{used} count until a non-zero most significant digit is found.  Also in this system, zero is considered to be a
1064 positive number which means that if the \textbf{used} count is decremented to zero, the sign must be set to
1085 As can be expected this algorithm is very simple.  The loop on step one is expected to iterate only once or twice at
1086 the most.  For example, this will happen in cases where there is not a carry to fill the last position.  Step two fixes the sign for
1087 when all of the digits are zero to ensure that the mp\_int is valid at all times.
1096 Note on line 28 how to test for the \textbf{used} count is made on the left of the \&\& operator.  In the C programming
1097 language the terms to \&\& are evaluated left to right with a boolean short-circuit if any condition fails.  This is
1098 important since if the \textbf{used} is zero the test on the right would fetch below the array.  That is obviously
1099 undesirable.  The parenthesis on line 31 is used to make sure the \textbf{used} count is decremented and not
1127 level basis of the entire library.  While these algorithm are relatively trivial it is important to understand how they
1173 step one of the mp\_copy algorithm the return of mp\_grow is not explicitly checked to ensure it succeeded.  Text space is
1174 limited so it is assumed that if a algorithm fails it will clear all temporarily allocated mp\_ints and return
1186 mp\_int structures passed to a function are one and the same.  For this case it is optimal to return immediately without
1189 The mp\_int $b$ must have enough digits to accomodate the used digits of the mp\_int $a$.  If $b.alloc$ is less than
1190 $a.used$ the algorithm mp\_grow is used to augment the precision of $b$ (lines 30 to 33).  In order to
1197 fact the alias for $b$ is carried through into the second ``for'' loop to clear the excess digits.  This optimization
1200 \textbf{Remarks.}  The use of pointer aliases is an implementation methodology first introduced in this function that will
1201 be used considerably in other functions.  Technically, a pointer alias is simply a short hand alias used to lower the
1220 In this case an alias is used to access the
1221 array of digits within an mp\_int structure directly.  It may seem that a pointer alias is strictly not required
1224 The first reason is that most compilers will not effectively optimize pointer arithmetic.  For example, some optimizations
1226 work for GCC and not MSVC.  As such it is ideal to find a common ground for as many compilers as possible.  Pointer
1230 The second reason is that pointer aliases often can make an algorithm simpler to read.  Consider the first ``for''
1240 Whether this code is harder to read depends strongly on the individual.  However, it is quantifiably slightly more
1244 Another commonly used technique in the source routines is that certain sections of code are nested.  This is used in
1255 Another common operation is to make a local temporary copy of an mp\_int argument.  To initialize an mp\_int
1256 and then copy another existing mp\_int into the newly intialized mp\_int will be known as creating a clone.  This is
1265 \textbf{Output}.  $a$ is initialized to be a copy of $b$. \\
1292 Reseting an mp\_int to the default state is a common step in many algorithms.  The mp\_zero algorithm will be the algorithm used to
1322 After the function is completed, all of the digits are zeroed, the \textbf{used} count is zeroed and the
1323 \textbf{sign} variable is set to \textbf{MP\_ZPOS}.
1327 With the mp\_int representation of an integer, calculating the absolute value is trivial.  The mp\_abs algorithm will compute
1348 This algorithm computes the absolute of an mp\_int input.  First it copies $a$ over $b$.  This is an example of an
1364 With the mp\_int representation of an integer, calculating the negation is also trivial.  The mp\_neg algorithm will compute
1402 have to make sure that only non--zero values get a \textbf{sign} of \textbf{MP\_NEG}.  If the mp\_int is zero
1403 than the \textbf{sign} is hard--coded to \textbf{MP\_ZPOS}.
1407 Often a mp\_int must be set to a relatively small value such as $1$ or $2$.  For these cases the mp\_set algorithm is useful.
1429 This algorithm sets a mp\_int to a small single digit value.  Step number 1 ensures that the integer is reset to the default state.  The
1430 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly.
1440 small positive constant.  mp\_zero() ensures that the \textbf{sign} is positive and the \textbf{used} count
1441 is zero.  Next we set the digit and reduce it modulo $\beta$ (line 22).  After this step we have to
1442 check if the resulting digit is zero or not.  If it is not then we set the \textbf{used} count to one, otherwise
1445 We can quickly reduce modulo $\beta$ since it is of the form $2^k$ and a quick binary AND operation with
1448 One important limitation of this function is that it will only set one digit.  The size of a digit is not fixed, meaning source that uses
1452 To overcome the limitations of the mp\_set algorithm the mp\_set\_int algorithm is ideal.  It accepts a ``long''
1478 next four bits from the source are extracted and are added to the mp\_int. The \textbf{used} digit count is
1479 incremented to reflect the addition.  The \textbf{used} digit counter is incremented since if any of the leading digits were zero the mp\_int would have
1493 seem obvious as to why the digit counter does not grow exceedingly large it is because of the shift on line 28
1499 Comparing a multiple precision integer is performed with the exact same algorithm used to compare two decimal numbers.  For example,
1500 to compare $1,234$ to $1,264$ the digits are extracted by their positions.  That is we compare $1 \cdot 10^3 + 2 \cdot 10^2 + 3 \cdot 10^1 + 4 \cdot 10^0$
1502 positions.  If any leading digit of one integer is greater than a digit in the same position of another integer then obviously it must be greater.
1504 The first comparision routine that will be developed is the unsigned magnitude compare which will perform a comparison based on the digits of two
1505 mp\_int variables alone.  It will ignore the sign of the two inputs.  Such a function is useful when an absolute comparison is required or if the
1543 By saying ``$a$ to the left of $b$'' it is meant that the comparison is with respect to $a$, that is if $a$ is greater than $b$ it will return
1545 Obviously if the digit counts differ there would be an imaginary zero digit in the smaller number where the leading digit of the larger number is.
1549 the zero'th digit.  If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}.
1559 performed before all of the digits are compared since it is a very cheap test to perform and can potentially save
1560 considerable time.  The implementation given is also not valid without those two statements.  $b.alloc$ may be
1566 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}).  Based on an unsigned magnitude
1602 has the positive sign is larger.   The inputs are compared (line 32) based on magnitudes.  If the signs were both
1603 negative then the unsigned comparison is performed in the opposite direction (line 34).  Otherwise, the signs are assumed to
1604 be both positive and a forward direction unsigned comparison is performed.
1611                      & of two random digits (of equal magnitude) before a difference is found. \\
1622 algorithms make use of the lower level algorithms and are the cruicial building block for the multiplication algorithms.  It is very important
1627 logical shifts respectively.  A logical shift is analogous to sliding the decimal point of radix-10 representations.  For example, the real
1628 number $0.9345$ is equivalent to $93.45\%$ which is found by sliding the the decimal two places to the right (\textit{multiplying by $\beta^2 = 10^2$}).
1629 Algebraically a binary logical shift is equivalent to a division or multiplication by a power of two.
1632 One significant difference between a logical shift and the way decimals are shifted is that digits below the zero'th position are removed
1634 result is $110_2$.
1638 $a - b\mbox{ (mod }2^{32}\mbox{)}$ is the same as $a + (2^{32} - b) \mbox{ (mod }2^{32}\mbox{)}$  since $2^{32} \equiv 0 \mbox{ (mod }2^{32}\mbox{)}$.
1641 However, in multiple precision arithmetic negative numbers are not represented in the same way.  Instead a sign flag is used to keep track of the
1645 The lower level algorithms will add or subtract integers without regard to the sign flag.  That is they will add or subtract the magnitude of
1649 An unsigned addition of multiple precision integers is performed with the same long-hand algorithm used to add decimal numbers.  That is to add the
1650 trailing digits first and propagate the resulting carry upwards.  Since this is a lower level algorithm the name will have a ``s\_'' prefix.
1697 This algorithm is loosely based on algorithm 14.7 of HAC \cite[pp. 594]{HAC} but has been extended to allow the inputs to have different magnitudes.
1699 MIX pseudo  machine code presented by Knuth \cite[pp. 266-267]{TAOCPV2} is incapable of handling inputs which are of different magnitudes.
1701 The first thing that has to be accomplished is to sort out which of the two inputs is the largest.  The addition logic
1707 same number of digits.  After the inputs are sorted the destination $c$ is grown as required to accomodate the sum
1708 of the two inputs.  The original \textbf{used} count of $c$ is copied and set to the new used count.
1711 variable $\mu$ is set to zero outside the loop.  Inside the loop an ``addition'' step requires three statements to produce
1713 two digits from $a$ and $b$ are added together along with the carry $\mu$.  The carry of this step is extracted and stored
1714 in $\mu$ and finally the digit of the result $c_n$ is truncated within the range $0 \le c_n < \beta$.
1716 Now all of the digit positions that both inputs have in common have been exhausted.  If $min \ne max$ then $x$ is an alias
1717 for one of the inputs that has more digits.  A simplified addition loop is then used to essentially copy the remaining digits
1720 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition.
1731 Note that $x$ is a pointer to an mp\_int assigned to the largest input, in effect it is a local alias.  Next we
1738 The initial carry $u$ will be cleared (line 65), note that $u$ is of type mp\_digit which ensures type
1741 (line 81 to 90) adds the remaining digits from the larger of the two inputs.  The addition is finished
1743 After line 94, $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$.  This is useful
1747 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm.  The principle difference is that the
1748 unsigned subtraction algorithm requires the result to be positive.  That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must
1749 be met for this algorithm to function properly.  Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly.
1753 For this algorithm a new variable is required to make the description simpler.  Recall from section 1.3.1 that a mp\_digit must be able to represent
1754 the range $0 \le x < 2\beta$ for the algorithms to work correctly.  However, it is allowable that a mp\_digit represent a larger range of values.  For
1758 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$.  In ISO C an ``unsigned long''
1797 This algorithm performs the unsigned subtraction of two mp\_int variables under the restriction that the result must be positive.  That is when
1799 algorithm is loosely based on algorithm 14.9 \cite[pp. 595]{HAC} and is similar to algorithm S in \cite[pp. 267]{TAOCPV2} as well.  As was the case
1802 The initial sorting of the inputs is trivial in this algorithm since $a$ is guaranteed to have at least the same magnitude of $b$.  Steps 1 and 2
1803 set the $min$ and $max$ variables.  Unlike the addition routine there is guaranteed to be no carry which means that the final result can be at
1804 most $max$ digits in length as opposed to $max + 1$.  Similar to the addition algorithm the \textbf{used} count of $c$ is copied locally and
1807 The subtraction loop that begins on step seven is essentially the same as the addition loop of algorithm s\_mp\_add except single precision
1808 subtraction is used instead.  Note the use of the $\gamma$ variable to extract the carry (\textit{also known as the borrow}) within the subtraction
1809 loops.  Under the assumption that two's complement single precision arithmetic is used this will successfully extract the desired carry.
1813 third bit of $0101_2$ is subtracted from the result it will cause another carry.  In this case though the carry will be forced to propagate all the
1818 is needed is a single zero or one bit for the carry.  Therefore a single logical shift right by $\gamma - 1$ positions is sufficient to extract the
1819 carry.  This method of carry extraction may seem awkward but the reason for it becomes apparent when the implementation is discussed.
1831 Like low level addition we ``sort'' the inputs.  Except in this case the sorting is hardcoded
1833 used to make the source code easier to read.  Again the pointer alias optimization is used
1838 the two inputs has been exhausted.  As remarked earlier there is an implementation reason for using the ``awkward''
1840 by $lg(\beta)$ positions and logically AND the least significant bit.  The AND operation is required because all of
1842 extraction requires two relatively cheap operations to extract the carry.  The other method is to simply shift the
1844 optimization only works on twos compliment machines which is a safe assumption to make.
1846 If $a$ has a larger magnitude than $b$ an additional loop (lines 64 through 73) is required to propagate
1855 flag.  A high level addition is actually performed as a series of eight separate cases which can be optimized down to three unique cases.
1882 This algorithm performs the signed addition of two mp\_int variables.  There is no reference algorithm to draw upon from
1883 either \cite{TAOCPV2} or \cite{HAC} since they both only provide unsigned operations.  The algorithm is fairly
1913 Figure~\ref{fig:AddChart} lists all of the eight possible input combinations and is sorted to show that only three
1918 Also note how the \textbf{sign} is set before the unsigned addition or subtraction is performed.  Recall from the descriptions of algorithms
1919 s\_mp\_add and s\_mp\_sub that the mp\_clamp function is used at the end to trim excess digits.  The mp\_clamp algorithm will set the \textbf{sign}
1922 For example, consider performing $-a + a$ with algorithm mp\_add.  By the description of the algorithm the sign is set to \textbf{MP\_NEG} which would
1923 produce a result of $-0$.  However, since the sign is set first then the unsigned addition is performed the subsequent usage of algorithm mp\_clamp
1933 The source code follows the algorithm fairly closely.  The most notable new source code addition is the usage of the $res$ integer variable which
1934 is used to pass result of the unsigned operations forward.  Unlike in the algorithm, the variable $res$ is merely returned as is without
1935 explicitly checking it and returning the constant \textbf{MP\_OKAY}.  The observation is this algorithm will succeed or fail only if the lower
1936 level functions do so.  Returning their return code is sufficient.
1939 The high level signed subtraction algorithm is essentially the same as the high level signed addition algorithm.
1969 This algorithm performs the signed subtraction of two inputs.  Similar to algorithm mp\_add there is no reference in either \cite{TAOCPV2} or
1970 \cite{HAC}.  Also this algorithm is restricted by algorithm s\_mp\_sub.  Chart \ref{fig:SubChart} lists the eight possible inputs and
1996 Similar to the case of algorithm mp\_add the \textbf{sign} is set first before the unsigned addition or subtraction.  That is to prevent the
2006 Much like the implementation of algorithm mp\_add the variable $res$ is used to catch the return code of the unsigned addition or subtraction operations
2007 and forward it to the end of the function.  On line 39 the ``not equal to'' \textbf{MP\_LT} expression is used to emulate a
2011 It is quite common to think of a multiple precision integer as a polynomial in $x$, that is $y = f(\beta)$ where $f(x) = \sum_{i=0}^{n-1} a_i x^i$.
2014 In order to facilitate operations on polynomials in $x$ as above a series of simple ``digit'' algorithms have to be established.  That is to shift
2015 the digits left or right as well to shift individual bits of the digits left and right.  It is important to note that not all ``shift'' operations
2020 In a binary system where the radix is a power of two multiplication by two not only arises often in other algorithms it is a fairly efficient
2021 operation to perform.  A single precision logical shift left is sufficient to multiply a single digit by two.
2055 This algorithm will quickly multiply a mp\_int by two provided $\beta$ is a power of two.  Neither \cite{TAOCPV2} nor \cite{HAC} describe such
2056 an algorithm despite the fact it arises often in other algorithms.  The algorithm is setup much like the lower level algorithm s\_mp\_add since
2057 it is for all intents and purposes equivalent to the operation $b = \vert a \vert + \vert a \vert$.
2060 is set to $a.used$ at step 4.  Only if there is a final carry will the \textbf{used} count require adjustment.
2062 Step 6 is an optimization implementation of the addition loop for this specific case.  That is since the two values being added together
2063 are the same there is no need to perform two reads from the digits of $a$.  Step 6.1 performs a single precision shift on the current digit $a_n$ to
2065 the previous carry.  Recall from section 4.1 that $a_n << 1$ is equivalent to $a_n \cdot 2$.  An iteration of the addition loop is finished with
2078 This implementation is essentially an optimized implementation of s\_mp\_add for the case of doubling an input.  The only noteworthy difference
2079 is the use of the logical shift operator on line 52 to perform a single precision doubling.
2120 Essentially the loop at step 6 is similar to that of mp\_mul\_2 except the logical shifts go in the opposite direction and the carry is at the
2131 Recall from section 4.3 that any integer can be represented as a polynomial in $x$ as $y = f(\beta)$.  Such a representation is also known as
2136 Converting from an array of digits to polynomial basis is very simple.  Consider the integer $y \equiv (a_2, a_1, a_0)_{\beta}$ and recall that
2137 $y = \sum_{i=0}^{2} a_i \beta^i$.  Simply replace $\beta$ with $x$ and the expression is in polynomial basis.  For example, $f(x) = 8x + 9$ is the
2138 polynomial basis representation for $89$ using radix ten.  That is, $f(10) = 8(10) + 9 = 89$.
2143 degree.  In this case $f(x) \cdot x = a_n x^{n+1} + a_{n-1} x^n + ... + a_0 x$.  From a scalar basis point of view multiplying by $x$ is equivalent to
2175 This algorithm multiplies an mp\_int by the $b$'th power of $x$.  This is equivalent to multiplying by $\beta^b$.  The algorithm differs
2177 motivation behind this change is due to the way this function is typically used.  Algorithms such as mp\_add store the result in an optionally
2178 different third mp\_int because the original inputs are often still required.  Algorithm mp\_lshd (\textit{and similarly algorithm mp\_rshd}) is
2179 typically used on values where the original value is no longer required.  The algorithm will return success immediately if
2180 $b \le 0$ since the rest of algorithm is only valid when $b > 0$.
2182 First the destination $a$ is grown as required to accomodate the result.  The counters $i$ and $j$ are used to form a \textit{sliding window} over
2183 the digits of $a$ of length $b$.  The head of the sliding window is at $i$ (\textit{the leading digit}) and the tail at $j$ (\textit{the trailing digit}).
2184 The loop on step 7 copies the digit from the tail to the head.  In each iteration the window is moved down one digit.   The last loop on
2203 The if statement (line 24) ensures that the $b$ variable is greater than zero since we do not interpret negative
2204 shift counts properly.  The \textbf{used} count is incremented by $b$ before the copy loop begins.  This elminates
2205 the need for an additional variable in the for loop.  The variable $top$ (line 42) is an alias
2206 for the leading digit while $bottom$ (line 45) is an alias for the trailing edge.  The aliases form a
2211 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit.
2243 This algorithm divides the input in place by the $b$'th power of $x$.  It is analogous to dividing by a $\beta^b$ but much quicker since
2246 If the input $b$ is less than one the algorithm quickly returns without performing any work.  If the \textbf{used} count is less than or equal
2249 After the trivial cases of inputs have been handled the sliding window is setup.  Much like the case of algorithm mp\_lshd a sliding window that
2250 is $b$ digits wide is used to copy the digits.  Unlike mp\_lshd the window slides in the opposite direction from the trailing to the leading digit.
2253 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented.
2262 The only noteworthy element of this routine is the lack of a return type since it cannot fail.  Like mp\_lshd() we
2264 the upper digits of the input to make sure the result is correct.
2270 shifts $k$ times to achieve a multiplication by $2^{\pm k}$ a mixture of whole digit shifting and partial digit shifting is employed.
2311 First the algorithm will multiply $a$ by $x^{\lfloor b / lg(\beta) \rfloor}$ which will ensure that the remainder multiplicand is less than
2316 required.  If it is non-zero a modified shift loop is used to calculate the remaining product.
2317 Essentially the loop is a generic version of algorithm mp\_mul\_2 designed to handle any shift count in the range $1 \le x < lg(\beta)$.  The $mask$
2318 variable is used to extract the upper $d$ bits to form the carry for the next iteration.
2320 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to
2321 complete.  It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow.
2330 The shifting is performed in--place which means the first step (line 25) is to copy the input to the
2334 If the shift count $b$ is larger than $lg(\beta)$ then a call to mp\_lshd() is used to handle all of the multiples
2376 This algorithm will divide an input $a$ by $2^b$ and produce the quotient and remainder.  The algorithm is designed much like algorithm
2387 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies.  The remainder $d$ may be optionally
2388 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable.    The temporary mp\_int variable $t$ is used to hold the
2390 the quotient is obtained.
2392 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d.  The only significant difference is
2397 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$.  This
2398 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$.
2430 This algorithm will quickly calculate the value of $a \mbox{ (mod }2^b\mbox{)}$.  First if $b$ is less than or equal to zero the
2431 result is set to zero.  If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns.  Otherwise, $a$
2432 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count.
2441 We first avoid cases of $b \le 0$ by simply mp\_zero()'ing the destination in such cases.  Next if $2^b$ is larger
2463                       & any $n$-bit input.  Note that the time of addition is ignored in the \\
2482 where in each of the algorithms single precision multiplication is the dominant operation performed.  This chapter will discuss integer multiplication
2485 The importance of the multiplier algorithms is for the most part driven by the fact that certain popular public key algorithms are based on modular
2486 exponentiation, that is computing $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for some arbitrary choice of $a$, $b$, $c$ and $d$.  During a modular
2488 35\% of the time performing squaring and 25\% of the time performing multiplications.} of the processor time is spent performing single precision
2492 against every digit of the other multiplicand.  Traditional long-hand multiplication is based on this process;  while the techniques can differ the
2493 overall algorithm used is essentially the same.  Only ``recently'' have faster algorithms been studied.  First Karatsuba multiplication was discovered in
2502 algorithm that school children are taught.  The algorithm is considered an $O(n^2)$ algorithm since for two $n$-digit inputs $n^2$ single precision
2506 The ``baseline multiplication'' algorithm is designed to act as the ``catch-all'' algorithm, only to be used when the faster algorithms cannot be
2508 facet of this algorithm, is that it has been modified to only produce a certain amount of output digits as resolution.  The importance of this
2558 algorithm.  The algorithm is loosely based on algorithm 14.12 from \cite[pp. 595]{HAC} and is similar to Algorithm M of Knuth \cite[pp. 268]{TAOCPV2}.
2562 The first thing this algorithm checks for is whether a Comba multiplier can be used instead.   If the minimum digit count of either
2563 input is less than $\delta$, then the Comba method may be used instead.    After the Comba method is ruled out, the baseline algorithm begins.  A
2564 temporary mp\_int variable $t$ is used to hold the intermediate result of the product.  This allows the algorithm to be used to
2567 All of step 5 is the infamous $O(n^2)$ multiplication loop slightly modified to only produce upto $digs$ digits of output.  The $pb$ variable
2568 is given the count of digits to read from $b$ inside the nested loop.  If $pb \le 1$ then no more output digits can be produced and the algorithm
2569 will exit the loop.  The best way to think of the loops are as a series of $pb \times 1$ multiplications.    That is, in each pass of the
2570 innermost loop $a_{ix}$ is multiplied against $b$ and the result is added (\textit{with an appropriate shift}) to $t$.
2572 For example, consider multiplying $576$ by $241$.  That is equivalent to computing $10^0(1)(576) + 10^1(4)(576) + 10^2(2)(576)$ which is best
2590 Each row of the product is added to the result after being shifted to the left (\textit{multiplied by a power of the radix}) by the appropriate
2591 count.  That is in pass $ix$ of the inner loop the product is added starting at the $ix$'th digit of the reult.
2594 is assumed to be a double wide output single precision multiplication.  That is, two single precision variables are multiplied to produce a
2595 double precision result.  The step is somewhat optimized from a long-hand multiplication algorithm because the carry from the addition in step
2596 5.4.1 is propagated through the nested loop.  If the carry was not propagated immediately it would overflow the single precision digit
2599 At step 5.5 the nested loop is finished and any carry that was left over should be forwarded.  The carry does not have to be added to the $ix+pb$'th
2600 digit since that digit is assumed to be zero at this point.  However, if $ix + pb \ge digs$ the carry is not set as it would make the result
2611 sing the Comba routine are that min$(a.used, b.used) < \delta$ and the number of digits of output is less than
2612 \textbf{MP\_WARRAY}.  This new constant is used to control the stack usage in the Comba routines.  By default it is
2613 set to $\delta$ but can be reduced when memory is at a premium.
2620 digits as output.  In each iteration of the outer loop the $pb$ variable is set (line 49) to the maximum
2624 carry from the previous iteration.  A particularly important observation is that most modern optimizing
2625 C compilers (GCC for instance) can recognize that a $N \times N \rightarrow 2N$ multiplication is all that
2626 is required for the product.  In x86 terms for example, this means using the MUL instruction.
2628 Each digit of the product is stored in turn (line 69) and the carry propagated (line 72) to the
2633 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be
2635 in parallel.  The ``Comba'' \cite{COMBA} method is named after little known (\textit{in cryptographic venues}) Paul G.
2640 At the heart of the Comba technique is once again the long-hand algorithm.  Except in this case a slight
2641 twist is placed on how the columns of the result are produced.  In the standard long-hand algorithm rows of products
2645 In the Comba algorithm the columns of the result are produced entirely independently of each other.  That is at
2646 the $O(n^2)$ level a simple multiplication and addition step is performed.  The carries of the columns are propagated
2647 after the nested loop to reduce the amount of work requiored. Succintly the first step of the algorithm is to compute
2654 Where $\vec x_n$ is the $n'th$ column of the output vector.  Consider the following example which computes the vector $\vec x$ for the multiplication
2674 At this point the vector $x = \left < 10, 34, 45, 31, 6 \right >$ is the result of the first step of the Comba multipler.
2675 Now the columns must be fixed by propagating the carry upwards.  The resultant vector will have one extra dimension over the input vector which is
2697 With that algorithm and $k = 5$ and $\beta = 10$ the following vector is produced $\vec x= \left < 1, 3, 8, 8, 1, 6 \right >$.  In this case
2698 $241 \cdot 576$ is in fact $138816$ and the procedure succeeded.  If the algorithm is correct and as will be demonstrated shortly more
2703 independently.  A serious obstacle is if the carry is lost, due to lack of precision before the algorithm has a chance to fix
2705 three single precision multiplications.  If the precision of the accumulator for the output digits is less then $3 \cdot (\beta - 1)^2$ then
2706 an overflow can occur and the carry information will be lost.  For any $m$ and $n$ digit inputs the maximum weight of any column is
2707 min$(m, n)$ which is fairly obvious.
2709 The maximum number of terms in any column of a product is known as the ``column weight'' and strictly governs when the algorithm can be used.  Recall
2723 Let $\rho = lg(\beta)$ represent the number of bits in a single precision digit.  By further re-arrangement of the equation the final solution is
2730 The defaults for LibTomMath are $\beta = 2^{28}$ and $\alpha = 2^{64}$ which means that $k$ is bounded by $k < 257$.  In this configuration
2731 the smaller input may not have more than $256$ digits if the Comba method is to be used.  This is quite satisfactory for most applications since
2732 $256$ digits would allow for numbers in the range of $0 \le x < 2^{7168}$ which, is much larger than most public key cryptographic algorithms require.
2778 The outer loop of this algorithm is more complicated than that of the baseline multiplier.  This is because on the inside of the
2782 The $ty$ variable is set to the minimum count of $ix$ or the number of digits in $b$.  That way if $a$ has more digits than
2783 $b$ this will be limited to $b.used - 1$.  The $tx$ variable is set to the to the distance past $b.used$ the variable
2784 $ix$ is.  This is used for the immediately subsequent statement where we find $iy$.
2786 The variable $iy$ is the minimum digits we can read from either $a$ or $b$ before running out.  Computing one column at a time
2789 move $ty$ downards so the equality remains valid.  The $iy$ variable is the number of iterations until
2796 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require
2798 the speed increase is actually much more.  With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply
2812 The inner loop (lines 71 to 74) of this implementation is where the tradeoff come into play.  Originally this comba
2816 is very high and it can keep the ALU fed with data.  It did, however, matter on older and embedded cpus where cache is often
2826 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required.  In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree.
2828 The product $a \cdot b \equiv f(x)g(x)$ is the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$.  The coefficients $w_i$ will
2829 directly yield the desired product when $\beta$ is substituted for $x$.  The direct solution to solve for the $2n + 1$ coefficients
2834 Gaussian elimination.  This technique is also occasionally refered to as the \textit{interpolation technique} (\textit{references please...}) since in
2843 is simply the product $W(0) = w_0 = a_0 \cdot b_0$.  The $\zeta_1$ term is the product
2844 $W(1) = \left (\sum_{i = 0}^{n} a_i \right ) \left (\sum_{i = 0}^{n} b_i \right )$.  The third point $\zeta_{\infty}$ is less obvious but rather
2845 simple to explain.  The $2n + 1$'th coefficient of $W(x)$ is numerically equivalent to the most significant column in an integer multiplication.
2846 The point at $\infty$ is used symbolically to represent the most significant column, that is $W(\infty) = w_{2n} = a_nb_n$.  Note that the
2860 polynomial $f(2^q)$ is equal to $2^q((2^qa_2) + a_1) + a_0$.  This technique of polynomial representation is known as Horner's method.
2862 As a general rule of the algorithm when the inputs are split into $n$ parts each there are $2n - 1$ multiplications.  Each multiplication is of
2863 multiplicands that have $n$ times fewer digits than the inputs.  The asymptotic running time of this algorithm is
2871 \hline $2$ & $1.584962501$ & This is Karatsuba Multiplication. \\
2872 \hline $3$ & $1.464973520$ & This is Toom-Cook Multiplication. \\
2886 At first it may seem like a good idea to choose $n = 1000$ since the exponent is approximately $1.1$.  However, the overhead
2903 on the AMD Athlon the ratio is roughly $17 : 1$ while on the Intel P4 it is $29 : 1$.  The higher the ratio in favour of multiplication the lower
2906 \item  The complexity of the linear system of equations (\textit{for the coefficients of $W(x)$}) is.  Generally speaking as the number of splits
2910 \item  To a lesser extent memory bandwidth and function call overheads.  Provided the values are in the processor cache this is less of an
2915 A clean cutoff point separation occurs when a point $y$ is found such that all of the cutoff point conditions are met.  For example, if the point
2916 is too low then there will be values of $m$ such that $m > y$ and the Comba method is still faster.  Finding the cutoff points is fairly simple when
2917 a high resolution timer is available.
2922 light algebra \cite{KARAP} that the following polynomial is equivalent to multiplication of the two integers the polynomials represent.
2929 this algorithm recursively, the work factor becomes $O(n^{lg(3)})$ which is substantially better than the work factor $O(n^2)$ of the Comba technique.  It turns
2930 out what Karatsuba did not know or at least did not publish was that this is simply polynomial basis multiplication with the points
2942 of this system of equations has made Karatsuba fairly popular.  In fact the cutoff point is often fairly low\footnote{With LibTomMath 0.18 it is 70 and 109 digits for the Intel P4 and AMD Athlon respectively.}
2989 This algorithm computes the unsigned product of two inputs using the Karatsuba multiplication algorithm.  It is loosely based on the description
2995 smallest input \textbf{used} count.  After the radix point is chosen the inputs are split into lower and upper halves.  Step 4 and 5
2999 $x0 \cdot y0$ and $x1 \cdot y1$.  The mp\_int $x0$ is used as a temporary variable after $x1 + x0$ has been computed.  By using $x0$ instead
3011 The new coding element in this routine, not  seen in previous routines, is the usage of goto statements.  The conventional
3012 wisdom is that goto statements should be avoided.  This is generally true, however when every single function call can fail, it makes sense
3021 The first algebraic portion of the algorithm is to split the two inputs into their halves.  However, instead of using mp\_mod\_2d and mp\_rshd
3024 is simpler to calculate both lower halves in a single loop.  The for loop on lines 102 and 107 calculate the upper halves $x1$ and
3029 When line 151 is reached, the algorithm has completed succesfully.  The ``error status'' variable $err$ is set to \textbf{MP\_OKAY} so that
3033 Toom-Cook $3$-Way \cite{TOOM} multiplication is essentially the polynomial basis algorithm for $n = 2$ except that the points  are
3034 chosen such that $\zeta$ is easy to compute and the resulting system of equations easy to reduce.  Here, the points $\zeta_{0}$,
3038 With the five relations that Toom-Cook specifies, the following system of equations is formed.
3125 description, several statements have been compounded to save space.  The intention is that the statements are executed from left to right across
3140 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known.  By substituting $\beta^{k}$ for $x$, the integer
3141 result $a \cdot b$ is produced.
3150 The first obvious thing to note is that this algorithm is complicated.  The complexity is worth it if you are multiplying very
3152 Toom--Cook than a Comba or baseline approach (this is a savings of more than 99$\%$).  For most ``crypto'' sized numbers this
3153 algorithm is not practical as Karatsuba has a much lower cutoff point.
3162 After this point we solve for the actual values of $w1, w2$ and $w3$ by reducing the $5 \times 5$ system which is relatively
3166 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required.  So far all
3202 available when the input is of appropriate size.  The \textbf{sign} of the result is not set until the end of the algorithm since algorithm
3212 The implementation is rather simplistic and is not particularly noteworthy.  Line 22 computes the sign of the result using the ``?''
3213 operator from the C programming language.  Line 48 computes $\delta$ using the fact that $1 << k$ is equal to $2^k$.
3218 Squaring is a special case of multiplication where both multiplicands are equal.  At first it may seem like there is no significant optimization
3219 available but in fact there is.  Consider the multiplication of $576$ against $241$.  In total there will be nine single precision multiplications
3242 represent the number being squared.  The first observation is that in row $k$ the $2k$'th column of the product has a $\left (x_k \right)^2$ term in it.
3244 The second observation is that every column $j$ in row $k$ where $j \ne 2k$ is part of a double product.  Every non-square term of a column will
3245 appear twice hence the name ``double product''.  Every odd column is made up entirely of double products.  In fact every column is made up of double
3248 The third and final observation is that for row $k$ the first unique non-square term, that is, one that hasn't already appeared in an earlier row,
3249 occurs at column $2k + 1$.  For example, on row $1$ of the previous squaring, column one is part of the double product with column one from row zero.
3250 Column two of row one is a square and column three is the first unique column.
3253 The baseline squaring algorithm is meant to be a catch-all squaring algorithm.  It will handle any of the input sizes that the faster routines
3295 This algorithm computes the square of an input using the three observations on squaring.  It is based fairly faithfully on  algorithm 14.16 of HAC
3296 \cite[pp.596-597]{HAC}.  Similar to algorithm s\_mp\_mul\_digs, a temporary mp\_int is allocated to hold the result of the squaring.  This allows the
3299 The outer loop of this algorithm begins on step 4. It is best to think of the outer loop as walking down the rows of the partial results, while
3304 very algorithm.  The product $a_{ix}a_{iy}$ will lie in the range $0 \le x \le \beta^2 - 2\beta + 1$ which is obviously less than $\beta^2$ meaning that
3305 when it is multiplied by two, it can be properly represented by a mp\_word.
3307 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial
3317 Inside the outer loop (line 34) the square term is calculated on line 37.  The carry (line 44) has been
3319 (lines 47 and 50) to simplify the inner loop.  The doubling is performed using two
3320 additions (line 59) since it is usually faster than shifting, if not at least as fast.
3322 The important observation is that the inner loop does not begin at $iy = 0$ like for multiplication.  As such the inner loops
3323 get progressively shorter as the algorithm proceeds.  This is what leads to the savings compared to using a multiplication to
3327 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop.  Squaring has an additional
3331 The first obvious solution is to make an array of mp\_words which will hold all of the columns.  This will indeed eliminate all of the carry
3333 that $2a + 2b + 2c = 2(a + b + c)$.  That is the sum of all of the double products is equal to double the sum of all the products.  For example,
3336 However, we cannot simply double all of the columns, since the squares appear only once per row.  The most practical solution is to have two
3363 \hspace{3mm}5.8  if $ix$ is even then \\
3384 This algorithm computes the square of an input using the Comba technique.  It is designed to be a replacement for algorithm
3385 s\_mp\_sqr when the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$.
3386 This algorithm is very similar to the Comba multiplier except with a few key differences we shall make note of.
3388 First, we have an accumulator and carry variables $\_ \hat W$ and $\hat W1$ respectively.  This is because the inner loop
3390 addition MIN condition on $iy$ (step 5.5) to prevent overlapping digits.  For example, $a_3 \cdot a_5$ is equal
3391 $a_5 \cdot a_3$.  Whereas in the multiplication case we would have $5 < a.used$ and $3 \ge 0$ is maintained since we double the sum
3392 of the products just outside the inner loop we have to avoid doing this.  This is also a good thing since we perform
3395 Finally the last difference is the addition of the ``square'' term outside the inner loop (step 5.8).  We add in the square
3396 only to even outputs and it is the square of the term at the $\lfloor ix / 2 \rfloor$ position.
3405 This implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for
3410 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$.  Instead of performing $2n + 1$
3429 point is fairly high.  For example, on an AMD Athlon XP processor with $\beta = 2^{28}$, the cutoff point is around 127 digits.
3475 This algorithm computes the square of an input $a$ using the Karatsuba technique.  This algorithm is very similar to the Karatsuba based
3478 The radix point for squaring is simply placed exactly in the middle of the digits when the input has an odd number of digits, otherwise it is
3480 as the radix point.  The first two squares in steps 6 and 7 are rather straightforward while the last square is of a more compact form.
3482 By expanding $\left (x1 + x0 \right )^2$, the $x1^2$ and $x0^2$ terms in the middle disappear, that is $(x0 - x1)^2 - (x1^2 + x0^2)  = 2 \cdot x0 \cdot x1$.
3483 Now if $5n$ single precision additions and a squaring of $n$-digits is faster than multiplying two $n$-digit numbers and doubling then
3484 this method is faster.  Assuming no further recursions occur, the difference can be estimated with the following inequality.
3502 This results in a cutoff point around $n = 2$.  As a consequence it is actually faster to compute the middle term the ``long way'' on processors
3503 where multiplication is substantially slower\footnote{On the Athlon there is a 1:17 ratio between clock cycles for addition and multiplication.  On
3504 the Intel P4 processor this ratio is 1:29 making this method even more beneficial.  The only common exception is the ARMv4 processor which has a
3514 This implementation is largely based on the implementation of algorithm mp\_karatsuba\_mul.  It uses the same inline style to copy and
3516 count of both $x0$ and $x1$ is fixed up and $x0$ is clamped before the calculations begin.  At this point $x1$ and $x0$ are valid equivalents
3520 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}).  On slower processors such as the Intel P4
3521 it is actually below the Comba limit (\textit{at 110 digits}).
3524 redirected to the error trap higher up.  If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and
3528 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used
3529 instead of multiplication to find the five relations.  The reader is encouraged to read the description of the latter algorithm and try to
3561 This algorithm computes the square of the input using one of four different algorithms.  If the input is very large and has at least
3562 \textbf{TOOM\_SQR\_CUTOFF} or \textbf{KARATSUBA\_SQR\_CUTOFF} digits then either the Toom-Cook or the Karatsuba Squaring algorithm is used.  If
3563 neither of the polynomial basis algorithms should be used then either the Comba or baseline algorithm is used.
3577 $\left [ 2 \right ] $ & In section 5.3 the fact that every column of a squaring is made up \\
3578                       & of double products and at most one square is stated.  Prove this statement. \\
3589                       & it is effective and add the logic to mp\_mul() and mp\_sqr(). \\
3600 Modular reduction is an operation that arises quite often within public key cryptography algorithms and various number theoretic algorithms,
3601 such as factoring.  Modular reduction algorithms are the third class of algorithms of the ``multipliers'' set.  A number $a$ is said to be \textit{reduced}
3602 modulo another number $b$ by finding the remainder of the division $a/b$.  Full integer division with remainder is a topic to be covered
3605 Modular reduction is equivalent to solving for $r$ in the following equation.  $a = bq + r$ where $q = \lfloor a/b \rfloor$.  The result
3606 $r$ is said to be ``congruent to $a$ modulo $b$'' which is also written as $r \equiv a \mbox{ (mod }b\mbox{)}$.  In other vernacular $r$ is known as the
3611 is in modular exponentiation algorithms.  That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible.  This operation is used in the
3620 division.  Barretts observation was that the residue $c$ of $a$ modulo $b$ is equal to
3626 Since algorithms such as modular exponentiation would be using the same modulus extensively, typical DSP\footnote{It is worth noting that Barrett's paper
3632 The trick used to optimize the above equation is based on a technique of emulating floating point data types with fixed precision integers.  Fixed
3634 fairly slow if not unavailable.   The idea behind fixed point arithmetic is to take a normal $k$-bit integer data type and break it into $p$-bit
3641 fixed point representation of $5$.  The product $ab$ is equal to $45(2^{2q})$ which when normalized by dividing by $2^q$ produces $45(2^q)$.
3644 of two fixed point numbers.  Using fixed point arithmetic, division can be easily approximated by multiplying by the reciprocal.  If $2^q$ is
3645 equivalent to one than $2^q/b$ is equivalent to the fixed point approximation of $1/b$ using real arithmetic.  Using this fact dividing an integer
3652 The precision of the division is proportional to the value of $q$.  If the divisor $b$ is used frequently as is the case with
3656 Consider dividing $19$ by $5$.  The correct result is $\lfloor 19/5 \rfloor = 3$.  With $q = 3$ the reciprocal is $\lfloor 2^q/5 \rfloor = 1$ which
3657 leads to a product of $19$ which when divided by $2^q$ produces $2$.  However, with $q = 4$ the reciprocal is $\lfloor 2^q/5 \rfloor = 3$ and
3658 the result of the emulated division is $\lfloor 3 \cdot 19 / 2^q \rfloor = 3$ which is correct.  The value of $2^q$ must be close to or ideally
3659 larger than the dividend.  In effect if $a$ is the dividend then $q$ should allow $0 \le \lfloor a/2^q \rfloor \le 1$ in order for this approach
3667 variable also helps re-inforce the idea that it is meant to be computed once and re-used.
3673 Provided that $2^q \ge a$ this algorithm will produce a quotient that is either exactly correct or off by a value of one.  In the context of Barrett
3674 reduction the value of $a$ is bound by $0 \le a \le (b - 1)^2$ meaning that $2^q \ge b^2$ is sufficient to ensure the reciprocal will have enough
3681 For example, if $b = 1179677$ and $q = 41$ ($2^q > b^2$), then the reciprocal $\mu$ is equal to $\lfloor 2^q / b \rfloor = 1864089$.  Consider reducing
3682 $a = 180388626447$ modulo $b$ using the above reduction equation.  The quotient using the new formula is $\lfloor (a \cdot \mu) / 2^q \rfloor = 152913$.
3683 By subtracting $152913b$ from $a$ the correct residue $a \equiv 677346 \mbox{ (mod }b\mbox{)}$ is found.
3688 See~\ref{sec:division} for further details.} might as well be used in its place.  The key to optimizing the reduction is to reduce the precision of
3691 Let $a$ represent the number of which the residue is sought.  Let $b$ represent the modulus used to find the residue.  Let $m$ represent
3692 the number of digits in $b$.  For the purposes of this discussion we will assume that the number of digits in $a$ is $2m$, which is generally true if
3693 two $m$-digit numbers have been multiplied.  Dividing $a$ by $b$ is the same as dividing a $2m$ digit integer by a $m$ digit integer.  Digits below the
3695 express this is by re-writing $a$ as two parts.  If $a' \equiv a \mbox{ (mod }b^m\mbox{)}$ and $a'' = a - a'$ then
3696 ${a \over b} \equiv {{a' + a''} \over b}$ which is equivalent to ${a' \over b} + {a'' \over b}$.  Since $a'$ is bound to be less than $b$ the quotient
3697 is bound by $0 \le {a' \over b} < 1$.
3699 Since the digits of $a'$ do not contribute much to the quotient the observation is that they might as well be zero.  However, if the digits
3701 with the irrelevant digits trimmed.  Now the modular reduction is trimmed to the almost equivalent equation
3707 Note that the original divisor $2^q$ has been replaced with $\beta^{m+1}$ where in this case $q$ is a multiple of $lg(\beta)$. Also note that the
3711 by as much as one (\textit{provided the radix point is chosen suitably}) and now that the lower irrelevent digits have been trimmed the quotient
3714 $b$ once or twice the residue is found.
3716 The quotient is now found using $(m + 1)(m) = m^2 + m$ single precision multiplications and the residue with an additional $m^2$ single
3718 This is considerably faster than the original attempt.
3721 represent the value of which the residue is desired.  In this case $q = 8$ since $10^7 < 9999^2$ meaning that $\mu = \lfloor \beta^{q}/b \rfloor = 10001$.
3722 With the new observation the multiplicand for the quotient is equal to $q_0 = \lfloor a / \beta^{m - 1} \rfloor = 99929$.  The quotient is then
3724 is found.
3728 it stands now the algorithm is already fairly fast compared to a full integer division algorithm.  However, there is still room for
3731 After the first multiplication inside the quotient ($q_0 \cdot \mu$) the value is shifted right by $m + 1$ places effectively nullifying the lower
3733 multiplications.  If the number of digits in the modulus $m$ is far less than $\beta$ a full product is not required for the algorithm to work properly.
3736 The value of $\mu$ is a $m$-digit number and $q_0$ is a $m + 1$ digit number.  Using a full multiplier $(m + 1)(m) = m^2 + m$ single precision
3741 After the quotient has been calculated it is used to reduce the input.  As previously noted the algorithm is not exact and it can be off by a small
3742 multiple of the modulus, that is $0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$.  If $b$ is $m$ digits than the
3743 result of reduction equation is a value of at most $m + 1$ digits (\textit{provided $3 < \beta$}) implying that the upper $m - 1$ digits are
3748 be reduced modulo $\beta^{m+1}$ before the multiple of $b$ is subtracted which simplifes the subtraction as well.  A multiplication that produces
3751 With both optimizations in place the algorithm is the algorithm Barrett proposed.  It requires $m^2 + 2m - 1$ single precision multiplications which
3752 is considerably faster than the straightforward $3m^2$ method.
3782 Now subtract the modulus if the residue is too large (e.g. quotient too small). \\
3795 This algorithm will reduce the input $a$ modulo $b$ in place using the Barrett algorithm.  It is loosely based on algorithm 14.42 of HAC
3796 \cite[pp.  602]{HAC} which is based on the paper from Paul Barrett \cite{BARRETT}.  The algorithm has several restrictions and assumptions which must
3799 First the modulus $b$ is assumed to be positive and greater than one.  If the modulus were less than or equal to one than subtracting
3801 for the quotient to have enough precision.  If $a$ is the product of two numbers that were already reduced modulo $b$, this will not be a problem.
3802 Technically the algorithm will still work if $a \ge b^2$ but it will take much longer to finish.  The value of $\mu$ is passed as an argument to this
3803 algorithm and is assumed to be calculated and stored before the algorithm is used.
3806 $s\_mp\_mul\_high\_digs$ which has not been presented is used to accomplish this task.  The algorithm is based on $s\_mp\_mul\_digs$ except that
3808 of digits in $b$ is very much smaller than $\beta$.
3810 While it is known that
3813 fixed up in case it is negative.  The invariant $\beta^{m+1}$ must be added to the residue to make it positive again.
3815 The while loop at step 9 will subtract $b$ until the residue is less than $b$.  If the algorithm is performed correctly this step is
3826 the number of single precision multiplications required.  However, the optimization is only safe if $\beta$ is much larger than the number of digits
3827 in the modulus.  In the source code this is evaluated on lines 36 to 44 where algorithm s\_mp\_mul\_high\_digs is used when it is
3853 This algorithm computes the reciprocal $\mu$ required for Barrett reduction.  First $\beta^{2m}$ is calculated as $2^{2 \cdot lg(\beta) \cdot  m}$ which
3854 is equivalent and much faster.  The final value is computed by taking the integer quotient of $\lfloor \mu / b \rfloor$.
3864 which would received the remainder is passed as NULL.  As will be discussed in~\ref{sec:division} the division routine allows both the quotient and the
3868 Montgomery reduction\footnote{Thanks to Niels Ferguson for his insightful explanation of the algorithm.} \cite{MONT} is by far the most interesting
3869 form of reduction in common use.  It computes a modular residue which is not actually equal to the residue of the input yet instead equal to a
3870 residue times a constant.  However, as perplexing as this may sound the algorithm is relatively simple and very efficient.
3873 $n$ must be odd.  The variable $x$ will represent the quantity of which the residue is sought.  Similar to the Barrett algorithm the input
3874 is restricted to $0 \le x < n^2$.  To begin the description some simple number theory facts must be established.
3877 to explain this is that $n$ is (\textit{or multiples of $n$ are}) congruent to zero modulo $n$.  Adding zero will not change the value of the residue.
3879 \textbf{Fact 2.}  If $x$ is even then performing a division by two in $\Z$ is congruent to $x \cdot 2^{-1} \mbox{ (mod }n\mbox{)}$.  Actually
3880 this is an application of the fact that if $x$ is evenly divisible by any $k \in \Z$ then division in $\Z$ will be congruent to
3894 \hspace{3mm}1.1  If $x$ is odd then \\
3905 The algorithm reduces the input one bit at a time using the two congruencies stated previously.  Inside the loop $n$, which is odd, is
3906 added to $x$ if $x$ is odd.  This forces $x$ to be even which allows the division by two in $\Z$ to be congruent to a modular division by two.  Since
3907 $x$ is assumed to be initially much larger than $n$ the addition of $n$ will contribute an insignificant magnitude to $x$.  Let $r$ represent the
3908 final result of the Montgomery algorithm.  If $k > lg(n)$ and $0 \le x < n^2$ then the final result is limited to
3909 $0 \le r < \lfloor x/2^k \rfloor + n$.  As a result at most a single subtraction is required to get the residue desired.
3933 Consider the example in figure~\ref{fig:MONT1} which reduces $x = 5555$ modulo $n = 257$ when $k = 9$ (note $\beta^k = 512$ which is larger than $n$).  The result of
3934 the algorithm $r = 178$ is congruent to the value of $2^{-9} \cdot 5555 \mbox{ (mod }257\mbox{)}$.  When $r$ is multiplied by $2^9$ modulo $257$ the correct residue
3935 $r \equiv 158$ is produced.
3938 and $k^2$ single precision additions.  At this rate the algorithm is most certainly slower than Barrett reduction and not terribly useful.
3950 \hspace{3mm}1.1  If the $t$'th bit of $x$ is one then \\
3960 This algorithm is equivalent since $2^tn$ is a multiple of $n$ and the lower $k$ bits of $x$ are zero by step 2.  The number of single
3961 precision shifts has now been reduced from $2k^2$ to $k^2 + k$ which is only a small improvement.
3988 With this algorithm a single shift right at the end is the only right shift required to reduce the input instead of $k$ right shifts inside the
3989 loop.  Note that for the iterations $t = 2, 5, 6$ and $8$ where the result $x$ is not changed.  In those iterations the $t$'th bit of $x$ is
3993 Instead of computing the reduction on a bit-by-bit basis it is actually much faster to compute it on digit-by-digit basis.  Consider the
4014 The value $\mu n \beta^t$ is a multiple of the modulus $n$ meaning that it will not change the residue.  If the first digit of
4026 In each iteration of the loop on step 1 a new value of $\mu$ must be calculated.  The value of $-1/n_0 \mbox{ (mod }\beta\mbox{)}$ is used
4045 The final result $900$ is then divided by $\beta^k$ to produce the final result $9$.  The first observation is that $9 \nequiv x \mbox{ (mod }n\mbox{)}$
4046 which implies the result is not the modular residue of $x$ modulo $n$.  However, recall that the residue is actually multiplied by $\beta^{-k}$ in
4048 the correct residue is $9 \cdot 15 \equiv 16 \mbox{ (mod }n\mbox{)}$.
4051 The baseline Montgomery reduction algorithm will produce the residue for any size input.  It is designed to be a catch-all algororithm for
4098 This algorithm reduces the input $x$ modulo $n$ in place using the Montgomery reduction algorithm.  The algorithm is loosely based
4101 for the Barrett algorithm.  Additionally if $n > 1$ and $n$ is odd there will exist a modular inverse $\rho$.  $\rho$ must be calculated in
4102 advance of this algorithm.  Finally the variable $k$ is fixed and a pseudonym for $n.used$.
4104 Step 2 decides whether a faster Montgomery algorithm can be used.  It is based on the Comba technique meaning that there are limits on
4105 the size of the input.  This algorithm is discussed in sub-section 6.3.3.
4107 Step 5 is the main reduction loop of the algorithm.  The value of $\mu$ is calculated once per iteration in the outer loop.  The inner loop
4122 This is the baseline implementation of the Montgomery reduction algorithm.  Lines 31 to 36 determine if the Comba based
4125 The multiplication $\mu n \beta^{ix}$ is performed in one step in the inner loop.  The alias $tmpx$ refers to the $ix$'th digit of $x$ and
4130 The Montgomery reduction requires fewer single precision multiplications than a Barrett reduction, however it is much slower due to the serial
4135 The biggest obstacle is that at the $ix$'th iteration of the outer loop the value of $x_{ix}$ is required to calculate $\mu$.  This means the
4136 carries from $0$ to $ix - 1$ must have been propagated upwards to form a valid $ix$'th digit.  The solution as it turns out is very simple.
4187 This algorithm will compute the Montgomery reduction of $x$ modulo $n$ using the Comba technique.  It is on most computer platforms significantly
4193 As in the other Comba reduction algorithms there is a $\hat W$ array which stores the columns of the product.  It is initially filled with the
4194 contents of $x$ with the excess digits zeroed.  The reduction loop is very similar the to the baseline loop at heart.  The multiplication on step
4197 a single precision multiplication instead half the amount of time is spent.
4199 Also note that digit $\hat W_{ix}$ must have the carry from the $ix - 1$'th digit propagated upwards in order for this to work.  That is what step
4201 how the upper bits of those same words are not reduced modulo $\beta$.  This is because those values will be discarded shortly and there is no
4214 The $\hat W$ array is first filled with digits of $x$ on line 48 then the rest of the digits are zeroed on line 55.  Both loops share
4217 The value of $\mu$ is calculated in an interesting fashion.  First the value $\hat W_{ix}$ is reduced modulo $\beta$ and cast to a mp\_digit.  This
4223 digit, that is $\_ \hat W_{t} = \hat W_{n.used + t}$.
4237 2.  If $b$ is even return(\textit{MP\_VAL}) \\
4252 to calculate $1/n_0$ when $\beta$ is a power of two.
4262 multiplications when $\beta$ is not the default 28-bits.
4265 The Diminished Radix method of modular reduction \cite{DRMET} is a fairly clever technique which can be more efficient than either the Barrett
4266 or Montgomery methods for certain forms of moduli.  The technique is based on the following simple congruence.
4274 of the above equation is very simple.  First write $x$ in the product form.
4287 into the equation the original congruence is reproduced, thus concluding the proof.  The following algorithm is based on this observation.
4314 once or twice and occasionally three times.  For simplicity sake the value of $x$ is bounded by the following simple polynomial.
4320 The true bound is  $0 \le x < (n - k - 1)^2$ but this has quite a few more terms.  The value of $q$ after step 1 is bounded by the following.
4326 Since $k^2$ is going to be considerably smaller than $n$ that term will always be zero.  The value of $x$ after step 3 is bounded trivially as
4327 $0 \le x < n$.  By step four the sum $x + q$ is bounded by
4333 With a second pass $q$ will be loosely bounded by $0 \le q < k^2$ after step 2 while $x$ will still be loosely bounded by $0 \le x < n$ after step 3.  After the second pass it is highly unlike that the
4369 is considerably larger than $(n - k - 1)^2 = 63504$ the algorithm still converges on the modular residue exceedingly fast.  In this case only
4375 modular reductions.  The usefulness of this algorithm becomes exceedingly clear when an appropriate modulus is chosen.
4377 Division in general is a very expensive operation to perform.  The one exception is when the division is by a power of the radix of representation used.
4378 Division by ten for example is simple for pencil and paper mathematics since it amounts to shifting the decimal place to the right.  Similarly division
4379 by two (\textit{or powers of two}) is very simple for binary computers to perform.  It would therefore seem logical to choose $n$ of the form $2^p$
4380 which would imply that $\lfloor x / n \rfloor$ is a simple shift of $x$ right $p$ bits.
4382 However, there is one operation related to division of power of twos that is even faster than this.  If $n = \beta^p$ then the division may be
4383 performed by moving whole digits to the right $p$ places.  In practice division by $\beta^p$ is much faster than division by $2^p$ for any $p$.
4387 modulus'' will refer to a modulus of the form $2^p - k$.  The word ``restricted'' in this case refers to the fact that it is based on the
4392 in step 2 is the most expensive operation.  Fortunately the choice of $k$ is not terribly limited.  For all intents and purposes it might
4393 as well be a single digit.  The smaller the value of $k$ is the faster the algorithm will be.
4399 of $x$ and $q$.  The resulting algorithm is very efficient and can lead to substantial improvements over Barrett and Montgomery reduction when modular
4438 and addition of $x \mbox{ mod }\beta^m$ are all performed simultaneously inside the loop on step 4.  The division by $\beta^m$ is emulated by accessing
4439 the term at the $m+i$'th position which is subsequently multiplied by $k$ and added to the term at the $i$'th position.  After the loop the $m$'th
4440 digit is set to the carry and the upper digits are zeroed.  Steps 5 and 6 emulate the reduction modulo $\beta^m$ that should have happend to
4443 At step 8 if $x$ is still larger than $n$ another pass of the algorithm is required.  First $n$ is subtracted from $x$ and then the algorithm resumes
4453 The first step is to grow $x$ as required to $2m$ digits since the reduction is performed in place on $x$.  The label on line 52 is where
4455 the modulus and question of whether $x$ is large enough are invariant after the first pass meaning that it would be a waste of time.
4457 The aliases $tmpx1$ and $tmpx2$ refer to the digits of $x$ where the latter is offset by $m$ digits.  By reading digits from $x$ offset by $m$ digits
4461 By line 67 the pointer $tmpx1$ points to the $m$'th digit of $x$ which is where the final carry will be placed.  Similarly by line 74 the
4464 Since the algorithm is only valid if both $x$ and $n$ are greater than zero an unsigned comparison suffices to determine if another pass is required.
4465 With the same logic at line 81 the value of $x$ is known to be greater than or equal to $n$ meaning that an unsigned subtraction can be used
4466 as well.  Since the destination of the subtraction is the larger of the inputs the call to algorithm s\_mp\_sub cannot fail and the return code
4470 To setup the restricted Diminished Radix algorithm the value $k = \beta - n_0$ is required.  This algorithm is not really complicated but provided for
4497 Another algorithm which will be useful is the ability to detect a restricted Diminished Radix modulus.  An integer is said to be
4506 \textbf{Output}.  $1$ if $n$ is in D.R form, $0$ otherwise \\
4520 This algorithm determines if a value is in Diminished Radix form.  Step 1 rejects obvious cases where fewer than two digits are
4532 The unrestricted Diminished Radix algorithm allows modular reductions to be performed when the modulus is of the form $2^p - k$.  This algorithm
4533 is a straightforward adaptation of algorithm~\ref{fig:DR}.
4535 In general the restricted Diminished Radix reduction algorithm is much faster since it has considerably lower overhead.  However, this new
4536 algorithm is much faster than either Montgomery or Barrett reduction when the moduli are of the appropriate form.
4544 \hspace{11.5mm}($a \ge 0$, $n > 1$, $0 < k < \beta$, $n + k$ is a power of two) \\
4564 This algorithm quickly reduces an input $a$ modulo an unrestricted Diminished Radix modulus $n$.  Division by $2^p$ is emulated with a right
4574 The algorithm mp\_count\_bits calculates the number of bits in an mp\_int which is used to find the initial value of $p$.  The call to mp\_div\_2d
4576 is kept fairly small.  The multiplication by $k$ is only performed if $k > 1$. This allows reductions modulo $2^p - 1$ to be performed without
4580 positive.  By using the unsigned versions the overhead is kept to a minimum.
4583 To setup this reduction algorithm the value of $k = 2^p - n$ is required.
4607 is sufficient to solve for $k$.  Alternatively if $n$ has more than one digit the value of $k$ is simply $\beta - n_0$.
4617 An integer $n$ is a valid unrestricted Diminished Radix modulus if either of the following are true.
4621 \item  The number has more than one digit and every bit from the $\beta$'th to the most significant is one.
4624 If either condition is true than there is a power of two $2^p$ such that $0 < 2^p - n < \beta$.   If the input is only
4641 \hspace{3mm}4.1  If the ($x \mbox{ mod }lg(\beta)$)'th bit of the $\lfloor x / lg(\beta) \rfloor$ of $n$ is zero then return($0$). \\
4651 This algorithm quickly determines if a modulus is of the form required for algorithm mp\_reduce\_2k to function properly.
4680 reduction can be written as a single function with the Comba technique it is much faster.  Barrett reduction suffers from the overhead of
4683 For almost every cryptographic algorithm Montgomery reduction is the algorithm of choice.  The one set of algorithms where Diminished Radix reduction truly
4705 Exponentiation is the operation of raising one variable to the power of another, for example, $a^b$.  A variant of exponentiation, computed
4706 in a finite field or ring, is called modular exponentiation.  This latter style of operation is typically used in public key
4707 cryptosystems such as RSA and Diffie-Hellman.  The ability to quickly compute modular exponentiations is of great benefit to any
4712 the number of multiplications becomes prohibitive.  Imagine what would happen if $b$ $\approx$ $2^{1024}$ as is the case when computing an RSA signature
4715 Fortunately there is a very simple algorithm based on the laws of exponents.  Recall that $lg_a(a^b) = b$ and that $lg_a(a^ba^c) = b + c$ which
4717 significant bit.  If $b$ is a $k$-bit integer than the following equation is true.
4723 By taking the base $a$ logarithm of both sides of the equation the following equation is the result.
4729 The term $a^{2^i}$ can be found from the $i - 1$'th term by squaring the term since $\left ( a^{2^i} \right )^2$ is equal to
4731 $k \over 2$ multiplications to compute the result.  This is indeed quite an improvement over simply multiplying by $a$ a total of $b-1$ times.
4733 While this current method is a considerable speed up there are further improvements to be made.  For example, the $a^{2^i}$ term does not need to
4757 This algorithm starts from the most significant bit and works towards the least significant bit.  When the $i$'th bit of $b$ is set $a$ is
4758 multiplied against the current product.  In each iteration the product is squared which doubles the exponent of the individual terms of the
4780 When the product $a^{32} \cdot a^8 \cdot a^4$ is simplified it is equal $a^{44}$ which is the desired exponentiation.  This particular algorithm is
4784 The first algorithm in the series of exponentiation algorithms will be an unbounded algorithm where the exponent is a single digit.  It is intended
4785 to be used when a small power of an input is required (\textit{e.g. $a^5$}).  It is faster than simply multiplying $b - 1$ times for all values of
4814 quickly compute the exponentiation.  It is loosely based on algorithm 14.79 of HAC \cite[pp. 615]{HAC} with the difference that the
4815 exponent is a fixed width.
4817 A copy of $a$ is made first to allow destination variable $c$ be the same as the source variable $a$.  The result is set to the initial value of
4820 Inside the loop the exponent is read from the most significant bit first down to the least significant bit.  First $c$ is invariably squared
4821 on step 3.1.  In the following step if the most significant bit of $b$ is one the copy of $a$ is multiplied against $c$.  The value
4822 of $b$ is shifted left one bit to make the next bit down from the most signficant bit the new most significant bit.  In effect each
4833 the most significant down towards the least significant. The invariant squaring operation placed on line 33 is performed first.  After
4834 the squaring the result $c$ is multiplied by the base $g$ if and only if the most significant bit of the exponent is set.  The shift on line
4838 When calculating an exponentiation the most time consuming bottleneck is the multiplications which are in general a small factor
4841 computes the same exponentiation.  A group of $k$ bits from the exponent is called a \textit{window}.  That is it is a small window on only a
4868 $2^{k - 1} + 1$ multiplications.  This algorithm assumes that the number of bits in the exponent is evenly divisible by $k$.
4869 However, when it is not the remaining $0 < x \le k - 1$ bits can be handled with algorithm~\ref{fig:LTOR}.
4877 approach is to brute force search amongst the values $k = 2, 3, \ldots, 8$ for the lowest result.  Table~\ref{fig:OPTK} lists optimal values of $k$
4903 A simple modification to the previous algorithm is only generate the upper half of the table in the range $2^{k-1} \le g < 2^k$.  Essentially
4904 this is a table for all values of $g$ where the most significant bit of $g$ is a one.  However, in order for this to be allowed in the
4941 \hspace{3mm}2.1  If the $i$'th bit of $b$ is a zero then \\
4957 algorithm requires the same number of squarings it can potentially have fewer multiplications.  The pre-computed table $a^g$ is also half
4966 In general the sliding window method is never slower than the generic $k$-ary method and often it is slightly faster.
4970 Modular exponentiation is essentially computing the power of a base within a finite field or ring.  For example, computing
4971 $d \equiv a^b \mbox{ (mod }c\mbox{)}$ is a modular exponentiation.  Instead of first computing $a^b$ and then reducing it
4972 modulo $c$ the intermediate result is reduced modulo $c$ after every squaring or multiplication operation.
4974 This guarantees that any intermediate result is bounded by $0 \le d \le c^2 - 2c + 1$ and can be reduced modulo $c$ quickly using
4978 will allow the exponent $b$ to be negative which is computed as $c \equiv \left (1 / a \right )^{\vert b \vert} \mbox{(mod }d\mbox{)}$. The
4979 value of $(1/a) \mbox{ mod }c$ is computed using the modular inverse (\textit{see \ref{sec;modinv}}).  If no inverse exists the algorithm
4995 3.  if $p$ is odd \textbf{OR} $p$ is a D.R. modulus then \\
5007 The first algorithm which actually performs modular exponentiation is algorithm s\_mp\_exptmod.  It is a sliding window $k$-ary algorithm
5019 In order to keep the algorithms in a known state the first step on line 29 is to reject any negative modulus as input.  If the exponent is
5020 negative the algorithm tries to perform a modular exponentiation with the modular inverse of the base $G$.  The temporary variable $tmpG$ is assigned
5021 the modular inverse of $G$ and $tmpX$ is assigned the absolute value of $X$.  The algorithm will recuse with these new values with a positive
5024 If the exponent is positive the algorithm resumes the exponentiation.  Line 77 determines if the modulus is of the restricted Diminished Radix
5025 form.  If it is not line 70 attempts to determine if it is of a unrestricted Diminished Radix form.  The integer $dr$ will take on one
5029 \item $dr = 0$ means that the modulus is not of either restricted or unrestricted Diminished Radix form.
5030 \item $dr = 1$ means that the modulus is of restricted Diminished Radix form.
5031 \item $dr = 2$ means that the modulus is of unrestricted Diminished Radix form.
5034 Line 69 determines if the fast modular exponentiation algorithm can be used.  It is allowed if $dr \ne 0$ or if the modulus is odd.  Otherwise,
5035 the slower s\_mp\_exptmod algorithm is used which uses Barrett reduction.
5108 \hspace{6mm}Window is full so perform the squarings and single multiplication. \\
5141 larger the window size becomes.  After a window size $winsize$ has been chosen an array of $2^{winsize}$ mp\_int variables is allocated.  This
5144 After the table is allocated the first power of $g$ is found.  Since $g \ge p$ is allowed it must be first reduced modulo $p$ to make
5145 the rest of the algorithm more efficient.  The first element of the table at $2^{winsize - 1}$ is found by squaring $M_1$ successively $winsize - 2$
5148 Now that the table is available the sliding window may begin.  The following list describes the functions of all the variables in the window.
5153          $1$ then there would be $lg(\beta) - 1$ zero bits before the first non-zero bit.  In this case bits are ignored until a non-zero bit is found.
5155          are read and a single squaring is performed.  If a non-zero bit is read a new window is created.
5156    \item When $mode = 2$ the algorithm is in the middle of forming a window and new bits are appended to the window from the most significant bit
5160       is fetched from the exponent.
5162 \item The variable $digidx$ is an index into the exponents digits.  It starts at the leading digit $x.used - 1$ and moves towards the trailing digit.
5163 \item The variable $bitcpy$ indicates how many bits are in the currently formed window.  When it reaches $winsize$ the window is flushed and
5168 All of step 12 is the window processing loop.  It will iterate while there are digits available form the exponent to read.  The first step
5169 inside this loop is to extract a new digit if no more bits are available in the current digit.  If there are no bits left a new digit is
5172 After a digit is made available step 12.3 will extract the most significant bit of the current digit and move all other bits in the digit
5173 upwards.  In effect the digit is read from most significant bit to least significant bit and since the digits are read from leading to
5174 trailing edges the entire exponent is read from most significant bit to least significant bit.
5176 At step 12.5 if the $mode$ and currently extracted bit $y$ are both zero the bit is ignored and the next bit is read.  This prevents the
5177 algorithm from having to perform trivial squaring and reduction operations before the first non-zero bit is read.  Step 12.6 and 12.7-10 handle
5189 a Left-to-Right algorithm is used to process the remaining few bits.
5200 on line 38 the value of $x$ is already known to be greater than $140$.
5202 The conditional piece of code beginning on line 48 allows the window size to be restricted to five bits.  This logic is used to ensure
5211 Calculating $b = 2^a$ can be performed much quicker than with any of the previous algorithms.  Recall that a logical shift left $m << k$ is
5248 The first section describes a method of integer division with remainder that is universally well known.  It provides the signed division logic
5256 Integer division aside from modular exponentiation is the most intensive algorithm to compute.  Like addition, subtraction and multiplication
5257 the basis of this algorithm is the long-hand division algorithm taught to school children.  Throughout this discussion several common variables
5272 \hspace{3mm}3.1  Maximize $k$ such that $kx\beta^t$ is less than or equal to $y$ and $(k + 1)x\beta^t$ is greater. \\
5288 To find the first digit of the quotient the value of $k$ must be maximized such that $kx\beta^t$ is less than or equal to $y$ and
5289 simultaneously $(k + 1)x\beta^t$ is greater than $y$.  Implicitly $k$ is the maximum value the $t$'th digit of the quotient may have.  The habitual method
5290 used to find the maximum is to ``eyeball'' the two numbers, typically only the leading digits and quickly estimate a quotient.  By only using leading
5292 arises as a possible  solution.  Indeed $2x\beta^2 = 4600$ is less than $y = 5471$ and simultaneously $(k + 1)x\beta^2 = 6900$ is larger than $y$.
5293 As a  result $k\beta^2$ is added to the quotient which now equals $q = 200$ and $4600$ is subtracted from $y$ to give a remainder of $y = 841$.
5295 Again this process is repeated to produce the quotient digit $k = 3$ which makes the quotient $q = 200 + 3\beta = 230$ and the remainder
5298 $237 \cdot 23 + 20 = 5471$ is true.
5304 speaking the estimation is based on assuming the lower $\vert \vert y \vert \vert - p$ and $\vert \vert x \vert \vert - p$ lower digits of the
5307 The value of the estimation may off by a few values in either direction and in general is fairly correct.  A simplification \cite[pp. 271]{TAOCPV2}
5308 of the estimation technique is to use $t + 1$ digits of the dividend and $t$ digits of the divisor, in particularly when $t = 1$.  The estimate
5309 using this technique is never too small.  For the following proof let $t = \vert \vert y \vert \vert - 1$ and $s = \vert \vert x \vert \vert - 1$
5312 \textbf{Proof.}\textit{  The quotient $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ is greater than or equal to
5314 The first obvious case is when $\hat k = \beta - 1$ in which case the proof is concluded since the real quotient cannot be larger.  For all other
5323 This is trivially true since $x \ge x_s\beta^s$.  Next we replace $\hat kx_s\beta^s$ by the previous inequality for $\hat kx_s$.
5329 By simplifying the previous inequality the following inequality is formed.
5345 For the purposes of division a normalized input is when the divisors leading digit $x_n$ is greater than or equal to $\beta / 2$.  By multiplying both
5346 $x$ and $y$ by $j = \lfloor (\beta / 2) / x_n \rfloor$ the quotient remains unchanged and the remainder is simply $j$ times the original
5347 remainder.  The purpose of normalization is to ensure the leading digit of the divisor is sufficiently large such that the estimated quotient will
5380 Normalize the inputs such that the leading digit of $y$ is greater than or equal to $\beta / 2$. \\
5449 This algorithm will calculate quotient and remainder from an integer division given a dividend and divisor.  The algorithm is a signed
5452 First the divisor $b$ must be non-zero which is enforced in step one.  If the divisor is larger than the dividend than the quotient is implicitly
5453 zero and the remainder is the dividend.
5455 After the first two trivial cases of inputs are handled the variable $q$ is setup to receive the digits of the quotient.  Two unsigned copies of the
5456 divisor $y$ and dividend $x$ are made as well.  The core of the division algorithm is an unsigned division and will only work if the values are
5457 positive.  Now the two values $x$ and $y$ must be normalized such that the leading digit of $y$ is greater than or equal to $\beta / 2$.
5458 This is performed by shifting both to the left by enough bits to get the desired normalization.
5460 At this point the division algorithm can begin producing digits of the quotient.  Recall that maximum value of the estimation used is
5461 $2\beta - {2 \over \beta}$ which means that a digit of the quotient must be first produced by another means.  In this case $y$ is shifted
5463 shifted copy of $y$ until $x$ is smaller.  Since the leading digit of $y$ is greater than or equal to $\beta/2$ this loop will iterate at most two
5466 Now the remainder of the digits can be produced.  The equation $\hat q = \lfloor {{x_i \beta + x_{i-1}}\over y_t} \rfloor$ is used to fairly
5468 induction the upper quotient digit is correct (\textit{as established on step eleven}) and the estimate must be less than $\beta$.
5470 Recall from section~\ref{sec:divest} that the estimation is never too low but may be too high.  The next step of the estimation process is
5474 After both phases of estimation the quotient digit may still be off by a value of one\footnote{This is similar to the error introduced
5478 Now that the quotient has been determine finializing the result is a matter of clamping the quotient, fixing the sizes and de-normalizing the
5480 is that when the estimations are being made (\textit{inside the loop on step 13.5}) that the digits $y_{t-1}$, $x_{i-2}$ and $x_{i-1}$ may lie
5492 remainder $d$ may be passed as a \textbf{NULL} pointer which indicates their value is not desired.  For example, the C code to call the division
5493 algorithm with only the quotient is
5503 The number of bits in the leading digit is calculated on line 151.  Implictly an mp\_int with $r$ digits will require $lg(\beta)(r-1) + k$ bits
5504 of precision which when reduced modulo $lg(\beta)$ produces the value of $k$.  In this case $k$ is the number of bits in the leading digit which is
5505 exactly what is required.  For the algorithm to operate $k$ must equal $lg(\beta) - 1$ and when it does not the inputs must be normalized by shifting
5511 The conditional ``continue'' on line 187 is used to prevent the algorithm from reading past the leading edge of $x$ which can occur when the
5512 algorithm eliminates multiple non-zero digits in a single iteration.  This ensures that $x_i$ is always non-zero since by definition the digits
5513 above the $i$'th position $x$ must be zero in order for the quotient to be precise\footnote{Precise as far as integer division is concerned.}.
5521 the helper functions assume the single digit input is positive and will treat them as such.
5559 The single digit subtraction algorithm mp\_sub\_d is essentially the same except it uses mp\_sub to subtract the digit from the mp\_int.
5563 multiplication algorithm.  Essentially this algorithm is a modified version of algorithm s\_mp\_mul\_digs where one of the multiplicands
5596 This algorithm quickly multiplies an mp\_int by a small single digit value.  It is specially tailored to the job and has a minimal of overhead.
5606 In this implementation the destination $c$ may point to the same mp\_int as the source $a$ since the result is written after the digit is
5610 Like the single digit multiplication algorithm, single digit division is also a fairly common algorithm used in radix conversion.  Since the
5611 divisor is only a single digit a specialized variant of the division algorithm can be used to compute the quotient.
5647 algorithm another digit of the dividend is reduced and another digit of quotient produced.  Provided $b < \beta$ the value of $\hat w$
5650 If the divisor $b$ is equal to three a variant of this algorithm is used which is called mp\_div\_3.  It replaces the division by three with
5651 a multiplication by $\lfloor \beta / 3 \rfloor$ and the appropriate shift and residual fixup.  In essence it is much like the Barrett reduction
5662 indicate the respective value is not required.  This allows a trivial single digit modular reduction algorithm, mp\_mod\_d to be created.
5670 Finding the $n$'th root of an integer is fairly easy as far as numerical analysis is concerned.  Algorithms such as the Newton-Raphson approximation
5678 In this case the $n$'th root is desired and $f(x) = x^n - a$ where $a$ is the integer of which the root is desired.  The derivative of $f(x)$ is
5679 simply $f'(x) = nx^{n - 1}$.  Of particular importance is that this algorithm will be used over the integers not over the a more continuous domain
5681 algorithm the $n$'th root $b$ of an integer $a$ is desired such that $b^n \le a$.
5691 1.  If $b$ is even and $a.sign = MP\_NEG$ return(\textit{MP\_VAL}). \\
5720 This algorithm finds the integer $n$'th root of an input using the Newton-Raphson approach.  It is partially optimized based on the observation
5721 that the numerator of ${f(x) \over f'(x)}$ can be derived from a partial denominator.  That is at first the denominator is calculated by finding
5725 The initial value of the approximation is t$2 = 2$ which allows the algorithm to start with very small values and quickly converge on the
5726 root.  Ideally this algorithm is meant to find the $n$'th root of an input where $n$ is bounded by $2 \le n \le 5$.
5739 is solely for simulations and not intended for cryptographic use.
5765 This algorithm produces a pseudo-random integer of $b$ digits.  By ensuring that the first digit is non-zero the algorithm also guarantees that the
5777 The ability to emit a radix-$n$ textual representation of an integer is useful for interacting with human parties.  For example, the ability to
5782 For the purposes of this text we will assume that a simple lower ASCII map (\ref{fig:ASC}) is used for the values of from $0$ to $63$ to
5783 printable characters.  For example, when the character ``N'' is read it represents the integer $23$.  The first $16$ characters of the
5834 \hspace{3mm}6.2  If $str_{iy}$ is not in the map or $y \ge r$ then goto step 7. \\
5847 string  to indicate the value is negative, otherwise it is assumed to be positive.  The algorithm will read up to $sn$ characters from the input
5859 Generating radix-$n$ output is fairly trivial with a division and remainder algorithm.
5895 each iteration the quotient $\lfloor a / r \rfloor$ is saved for the next iteration.  As a result a series of trivial $n \times 1$ divisions
5896 are required instead of a series of $n \times k$ divisions.  One design flaw of this approach is that the digits are produced in the reverse order
5928 The greatest common divisor of two integers $a$ and $b$, often denoted as $(a, b)$ is the largest integer $k$ that is a proper divisor of
5929 both $a$ and $b$.  That is, $k$ is the largest integer such that $0 \equiv a \mbox{ (mod }k\mbox{)}$ and $0 \equiv b \mbox{ (mod }k\mbox{)}$ occur
5932 The most common approach (cite) is to reduce one input modulo another.  That is if $a$ and $b$ are divisible by some integer $k$ and if $qa + r = b$ then
5933 $r$ is also divisible by $k$.  The reduction pattern follows $\left < a , b \right > \rightarrow \left < b, a \mbox{ mod } b \right >$.
5957 relatively expensive operations to perform and should ideally be avoided.  There is another approach based on a similar relationship of
5958 greatest common divisors.  The faster approach is based on the observation that if $k$ divides both $a$ and $b$ it will also divide $a - b$.
5970 \hspace{3mm}1.1  Swap $a$ and $b$ such that $a$ is the smallest of the two. \\
5987 As a matter of practicality algorithm \ref{fig:gcd1} decreases far too slowly to be useful.  Specially if $b$ is much larger than $a$ such that
5988 $b - a$ is still very much larger than $a$.  A simple addition to the algorithm is to divide $b - a$ by a power of some integer $p$ which does
5989 not divide the greatest common divisor but will divide $b - a$.  In this case ${b - a} \over p$ is also an integer and still divisible by
5993 Then inside the loop whenever $b - a$ is divisible by some power of $p$ it can be safely removed.
6008 3.  While $a$ is divisible by $p$ do \\
6010 4.  While $b$ is divisible by $p$ do \\
6013 \hspace{3mm}5.1  Swap $a$ and $b$ such that $a$ is the smallest of the two. \\
6015 \hspace{3mm}5.3  While $b$ is divisible by $p$ do \\
6026 This algorithm is based on the first except it removes powers of $p$ first and inside the main loop to ensure the tuple $\left < a, b \right >$
6027 decreases more rapidly.  The first loop on step two removes powers of $p$ that are in common.  A count, $k$, is kept which will present a common
6032 to compute.  The ideal choice of $p$ is two since division by two amounts to a right logical shift.  Another important observation is that by
6083 The first two steps handle the cases where either one of or both inputs are zero.  If either input is zero the greatest common divisor is the
6087 Step five will divide out any common factors of two and keep track of the count in the variable $k$.  After this step, two is no longer a
6092 By step eight both of $u$ and $v$ are odd which is required for the inner logic.  First the pair are swapped such that $v$ is equal to
6106 This function makes use of the macros mp\_iszero and mp\_iseven.  The former evaluates to $1$ if the input mp\_int is equivalent to the
6108 it evaluates to $0$.  Note that just because mp\_iseven may evaluate to $0$ does not mean the input is odd, it could also be zero.  The three
6113 zero bits in both.  The local integer $k$ is used to keep track of how many factors of $2$ are pulled out of both values.  It is assumed that
6115 entries than are accessible by an ``int'' so this is not a limitation.}.
6119 on line 73 performs the reduction of the pair until $v$ is equal to zero.  The unsigned comparison and subtraction algorithms are used in
6120 place of the full signed routines since both values are guaranteed to be positive and the result of the subtraction is guaranteed to be non-negative.
6123 The least common multiple of a pair of integers is their product divided by their greatest common divisor.  For two integers $a$ and $b$ the
6124 least common multiple is normally denoted as $[ a, b ]$ and numerically equivalent to ${ab} \over {(a, b)}$.  For example, if $a = 2 \cdot 2 \cdot 3 = 12$
6125 and $b = 2 \cdot 3 \cdot 3 \cdot 7 = 126$ the least common multiple is ${126 \over {(12, 126)}} = {126 \over 6} = 21$.
6128 collide, that is be in synchronous states, after only $[ a, b ]$ iterations.  This is why, for example, random number generators based on
6129 Linear Feedback Shift Registers (LFSR) tend to use registers with periods which are co-prime (\textit{e.g. the greatest common divisor is one.}).
6162 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg.  What is the name of this?} off which the Jacobi symbol is
6163 defined.  The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$.  Numerically it is
6170                               -1 &  \mbox{if }a\mbox{ is a quadratic non-residue.} \\
6172                               1  &  \mbox{if }a\mbox{ is a quadratic residue}.
6178 An integer $a$ is a quadratic residue if the following equation has a solution.
6192 Whether equation \ref{eqn:root} has a solution or not equation \ref{eqn:rooti} is always true.  If $a^{(p-1)/2} - 1 \equiv 0 \mbox{ (mod }p\mbox{)}$
6202 is not a quadratic residue then the only other value $a^{(p-1)/2}$ may be congruent to is $-1$ since
6209 The Jacobi symbol is a generalization of the Legendre function for any odd non prime moduli $p$ greater than 2.  If $p = \prod_{i=0}^n p_i$ then
6210 the Jacobi symbol $\left ( { a \over p } \right )$ is equal to the following equation.
6216 By inspection if $p$ is prime the Jacobi symbol is equivalent to the Legendre function.  The following facts\footnote{See HAC \cite[pp. 72-74]{HAC} for
6217 further details.} will be used to derive an efficient Jacobi symbol algorithm.  Where $p$ is an odd integer greater than two and $a, b \in \Z$ the
6249 By putting both observations into equation \ref{eqn:jacobi} the following simplified equation is formed.
6256 $\left ( {2 \over p } \right )^k$ equals $1$ if $k$ is even otherwise it equals $\left ( {2 \over p } \right )$.  Using this approach the
6258 Jacobi symbol computation of $\left ( {1 \over a'} \right )$ which is simply $1$.
6302 is based on algorithm 2.149 of HAC \cite[pp. 73]{HAC}.
6305 input $a$.  If $k$ is even than the term $\left ( { 2 \over p } \right )^k$ must always evaluate to one.  If $k$ is odd than the term evaluates to one
6306 if $p_0$ is congruent to one or seven modulo eight, otherwise it evaluates to $-1$. After the the $\left ( { 2 \over p } \right )^k$ term is handled
6307 the $(-1)^{(p-1)(a'-1)/4}$ is computed and multiplied against the current product $s$.  The latter term evaluates to one if both $p$ and $a'$
6310 By step nine if $a'$ does not equal one a recursion is required.  Step 9.1 computes $p' \equiv p \mbox{ (mod }a'\mbox{)}$ and will recurse to compute
6311 $\left ( {p' \over a'} \right )$ which is multiplied against the current Jacobi product.
6320 As a matter of practicality the variable $a'$ as per the pseudo-code is reprensented by the variable $a1$ since the $'$ symbol is not valid for a C
6323 The two simple cases of $a = 0$ and $a = 1$ are handled at the very beginning to simplify the algorithm.  If the input is non-trivial the algorithm
6324 has to proceed compute the Jacobi.  The variable $s$ is used to hold the current Jacobi product.  Note that $s$ is merely a C ``int'' data type since
6327 After a local copy of $a$ is made all of the factors of two are divided out and the total stored in $k$.  Technically only the least significant
6328 bit of $k$ is required, however, it makes the algorithm simpler to follow to perform an addition. In practice an exclusive-or and addition have the same
6329 processor requirements and neither is faster than the other.
6331 Line 58 through 71 determines the value of $\left ( { 2 \over p } \right )^k$.  If the least significant bit of $k$ is zero than
6332 $k$ is even and the value is one.  Otherwise, the value of $s$ depends on which residue class $p$ belongs to modulo eight.  The value of
6333 $(-1)^{(p-1)(a'-1)/4}$ is compute and multiplied against $s$ on lines 71 through 74.
6342 exist another integer $b$ such that $ab \equiv 1 \mbox{ (mod }p\mbox{)}$.  The integer $b$ is called the multiplicative inverse of $a$ which is
6343 denoted as $b = a^{-1}$.  Technically speaking modular inversion is a well defined operation for any finite ring or field not just for rings and
6346 The simplest approach is to compute the algebraic inverse of the input.  That is to compute $b \equiv a^{\Phi(p) - 1}$.  If $\Phi(p)$ is the
6347 order of the multiplicative subgroup modulo $p$ then $b$ must be the multiplicative inverse of $a$.  The proof of which is trivial.
6353 However, as simple as this approach may be it has two serious flaws.  It requires that the value of $\Phi(p)$ be known which if $p$ is composite
6354 requires all of the prime factors.  This approach also is very slow as the size of $p$ grows.
6356 A simpler approach is based on the observation that solving for the multiplicative inverse is equivalent to solving the linear
6363 Where $a$, $b$, $p$ and $q$ are all integers.  If such a pair of integers $ \left < b, q \right >$ exist than $b$ is the multiplicative inverse of
6366 binary approach is very similar to the binary greatest common divisor algorithm except it will produce a full solution to the Diophantine
6419 This algorithm computes the modular multiplicative inverse of an integer $a$ modulo an integer $b$.  This algorithm is a variation of the
6423 If $b \le 0$ than the modulus is invalid and MP\_VAL is returned.  Similarly if both $a$ and $b$ are even then there cannot be a multiplicative
6424 inverse for $a$ and the error is reported.
6427 the other variables to the Diophantine equation are solved.  The algorithm terminates when $u = 0$ in which case the solution is
6433 If $v$, the greatest common divisor of $a$ and $b$ is not equal to one then the algorithm will report an error as no inverse exists.  Otherwise, $C$
6434 is the modular inverse of $a$.  The actual value of $C$ is congruent to, but not necessarily equal to, the ideal modular inverse which should lie
6435 within $1 \le a^{-1} < b$.  Step numbers twelve and thirteen adjust the inverse until it is in range.  If the original input $a$ is within $0 < a < p$
6447 When the modulus $b$ is odd the variables $A$ and $C$ are fixed and are not required to compute the inverse.  In particular by attempting to solve
6450 The algorithm fast\_mp\_invmod is a direct adaptation of algorithm mp\_invmod with all all steps involving either $A$ or $C$ removed.  This
6455 A non-zero integer $a$ is said to be prime if it is not divisible by any other integer excluding one and itself.  For example, $a = 7$ is prime
6456 since the integers $2 \ldots 6$ do not evenly divide $a$.  By contrast, $a = 6$ is not prime since $a = 6 = 2 \cdot 3$.
6458 Prime numbers arise in cryptography considerably as they allow finite fields to be formed.  The ability to determine whether an integer is prime or
6460 probablistic algorithms in that when they report an integer is composite it must be composite.  However, when the algorithms report an integer is
6463 As will be discussed it is possible to limit the probability of error so well that for practical purposes the probablity of error might as
6464 well be zero.  For the purposes of these discussions let $n$ represent the candidate integer of which the primality is in question.
6469 cannot be prime.  By dividing by all primes $1 < p \le \sqrt{n}$ this test can actually prove whether an integer is prime.  However, such a test
6473 of the primes less than $\sqrt{n} + 1$ the algorithm cannot prove if a candidate is prime.  However, often it can prove a candidate is not prime.
6475 The benefit of this test is that trial division by small values is fairly efficient.  Specially compared to the other algorithms that will be
6476 discussed shortly.  The probability that this approach correctly identifies a composite candidate when tested with all primes upto $q$ is given by
6480 At approximately $q = 30$ the gain of performing further tests diminishes fairly quickly.  At $q = 90$ further testing is generally not going to
6481 be of any practical use.  In the case of LibTomMath the default limit $q = 256$ was chosen since it is not too high and will eliminate
6482 approximately $80\%$ of all candidate integers.  The constant \textbf{PRIME\_SIZE} is equal to the number of primes in the test base.  The
6483 array \_\_prime\_tab is an array of the first \textbf{PRIME\_SIZE} prime numbers.
6491 \textbf{Output}.  $c = 1$ if $n$ is divisible by a small prime, otherwise $c = 0$.  \\
6507 This algorithm attempts to determine if a candidate integer $n$ is composite by performing trial divisions.
6517 mp\_digit.  The table \_\_prime\_tab is defined in the following file.
6526 Note that there are two possible tables.  When an mp\_digit is 7-bits long only the primes upto $127$ may be included, otherwise the primes
6527 upto $1619$ are used.  Note that the value of \textbf{PRIME\_SIZE} is a constant dependent on the size of a mp\_digit.
6530 The Fermat test is probably one the oldest tests to have a non-trivial probability of success.  It is based on the fact that if $n$ is in
6531 fact prime then $a^{n} \equiv a \mbox{ (mod }n\mbox{)}$ for all $0 < a < n$.  The reason being that if $n$ is prime than the order of
6532 the multiplicative sub group is $n - 1$.  Any base $a$ must have an order which divides $n - 1$ and as such $a^n$ is equivalent to
6535 If $n$ is composite then any given base $a$ does not have to have a period which divides $n - 1$.  In which case
6536 it is possible that $a^n \nequiv a \mbox{ (mod }n\mbox{)}$.  However, this test is not absolute as it is possible that the order
6537 of a base will divide $n - 1$ which would then be reported as prime.  Such a base yields what is known as a Fermat pseudo-prime.  Several
6562 This algorithm determines whether an mp\_int $a$ is a Fermat prime to the base $b$ or not.  It uses a single modular exponentiation to
6573 The Miller-Rabin (citation) test is another primality test which has tighter error bounds than the Fermat test specifically with sequentially chosen
6574 candidate  integers.  The algorithm is based on the observation that if $n - 1 = 2^kr$ and if $b^r \nequiv \pm 1$ then after upto $k - 1$ squarings the
6575 value must be equal to $-1$.  The squarings are stopped as soon as $-1$ is observed.  If the value of $1$ is observed first it means that
6576 some value not congruent to $\pm 1$ when squared equals one which cannot occur if $n$ is prime.
6584 \textbf{Output}.  $c = 1$ if $a$ is a Miller-Rabin prime to the base $a$, otherwise $c = 0$.  \\
6610 if $b$ is composite or $c = 0$ if $b$ is provably composite.  The values of $s$ and $r$ are computed such that $a' = a - 1 = 2^sr$.
6612 If the value $y \equiv b^r$ is congruent to $\pm 1$ then the algorithm cannot prove if $a$ is composite or not.  Otherwise, the algorithm will
6614 is provably composite.  If the algorithm performs $s - 1$ squarings and $y \nequiv -1$ then $a$ is provably composite.  If $a$ is not provably
6615 composite then it is \textit{probably} prime.