1230363Sdas$NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $
2129203Scognet$FreeBSD$
3129203Scognet
4129203ScognetSoftFloat Release 2a General Documentation
5129203Scognet
6129203ScognetJohn R. Hauser
7129203Scognet1998 December 13
8129203Scognet
9129203Scognet
10129203Scognet-------------------------------------------------------------------------------
11129203ScognetIntroduction
12129203Scognet
13129203ScognetSoftFloat is a software implementation of floating-point that conforms to
14129203Scognetthe IEC/IEEE Standard for Binary Floating-Point Arithmetic.  As many as four
15129203Scognetformats are supported:  single precision, double precision, extended double
16129203Scognetprecision, and quadruple precision.  All operations required by the standard
17129203Scognetare implemented, except for conversions to and from decimal.
18129203Scognet
19129203ScognetThis document gives information about the types defined and the routines
20129203Scognetimplemented by SoftFloat.  It does not attempt to define or explain the
21129203ScognetIEC/IEEE Floating-Point Standard.  Details about the standard are available
22129203Scognetelsewhere.
23129203Scognet
24129203Scognet
25129203Scognet-------------------------------------------------------------------------------
26129203ScognetLimitations
27129203Scognet
28129203ScognetSoftFloat is written in C and is designed to work with other C code.  The
29129203ScognetSoftFloat header files assume an ISO/ANSI-style C compiler.  No attempt
30230363Sdashas been made to accommodate compilers that are not ISO-conformant.  In
31129203Scognetparticular, the distributed header files will not be acceptable to any
32129203Scognetcompiler that does not recognize function prototypes.
33129203Scognet
34129203ScognetSupport for the extended double-precision and quadruple-precision formats
35129203Scognetdepends on a C compiler that implements 64-bit integer arithmetic.  If the
36129203Scognetlargest integer format supported by the C compiler is 32 bits, SoftFloat is
37129203Scognetlimited to only single and double precisions.  When that is the case, all
38129203Scognetreferences in this document to the extended double precision, quadruple
39129203Scognetprecision, and 64-bit integers should be ignored.
40129203Scognet
41129203Scognet
42129203Scognet-------------------------------------------------------------------------------
43129203ScognetContents
44129203Scognet
45129203Scognet    Introduction
46129203Scognet    Limitations
47129203Scognet    Contents
48129203Scognet    Legal Notice
49129203Scognet    Types and Functions
50129203Scognet    Rounding Modes
51129203Scognet    Extended Double-Precision Rounding Precision
52129203Scognet    Exceptions and Exception Flags
53129203Scognet    Function Details
54129203Scognet        Conversion Functions
55129203Scognet        Standard Arithmetic Functions
56129203Scognet        Remainder Functions
57129203Scognet        Round-to-Integer Functions
58129203Scognet        Comparison Functions
59129203Scognet        Signaling NaN Test Functions
60129203Scognet        Raise-Exception Function
61129203Scognet    Contact Information
62129203Scognet
63129203Scognet
64129203Scognet
65129203Scognet-------------------------------------------------------------------------------
66129203ScognetLegal Notice
67129203Scognet
68129203ScognetSoftFloat was written by John R. Hauser.  This work was made possible in
69129203Scognetpart by the International Computer Science Institute, located at Suite 600,
70129203Scognet1947 Center Street, Berkeley, California 94704.  Funding was partially
71129203Scognetprovided by the National Science Foundation under grant MIP-9311980.  The
72129203Scognetoriginal version of this code was written as part of a project to build
73129203Scogneta fixed-point vector processor in collaboration with the University of
74129203ScognetCalifornia at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
75129203Scognet
76129203ScognetTHIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE.  Although reasonable effort
77129203Scognethas been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
78129203ScognetTIMES RESULT IN INCORRECT BEHAVIOR.  USE OF THIS SOFTWARE IS RESTRICTED TO
79129203ScognetPERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY
80129203ScognetAND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.
81129203Scognet
82129203Scognet
83129203Scognet-------------------------------------------------------------------------------
84129203ScognetTypes and Functions
85129203Scognet
86129203ScognetWhen 64-bit integers are supported by the compiler, the `softfloat.h' header
87129203Scognetfile defines four types:  `float32' (single precision), `float64' (double
88129203Scognetprecision), `floatx80' (extended double precision), and `float128'
89129203Scognet(quadruple precision).  The `float32' and `float64' types are defined in
90129203Scognetterms of 32-bit and 64-bit integer types, respectively, while the `float128'
91129203Scognettype is defined as a structure of two 64-bit integers, taking into account
92129203Scognetthe byte order of the particular machine being used.  The `floatx80' type
93129203Scognetis defined as a structure containing one 16-bit and one 64-bit integer, with
94129203Scognetthe machine's byte order again determining the order of the `high' and `low'
95129203Scognetfields.
96129203Scognet
97129203ScognetWhen 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
98129203Scognetheader file defines only two types:  `float32' and `float64'.  Because
99129203ScognetISO/ANSI C guarantees at least one built-in integer type of 32 bits,
100129203Scognetthe `float32' type is identified with an appropriate integer type.  The
101129203Scognet`float64' type is defined as a structure of two 32-bit integers, with the
102129203Scognetmachine's byte order determining the order of the fields.
103129203Scognet
104129203ScognetIn either case, the types in `softfloat.h' are defined such that if a system
105129203Scognetimplements the usual C `float' and `double' types according to the IEC/IEEE
106129203ScognetStandard, then the `float32' and `float64' types should be indistinguishable
107129203Scognetin memory from the native `float' and `double' types.  (On the other hand,
108129203Scognetwhen `float32' or `float64' values are placed in processor registers by
109129203Scognetthe compiler, the type of registers used may differ from those used for the
110129203Scognetnative `float' and `double' types.)
111129203Scognet
112129203ScognetSoftFloat implements the following arithmetic operations:
113129203Scognet
114129203Scognet-- Conversions among all the floating-point formats, and also between
115129203Scognet   integers (32-bit and 64-bit) and any of the floating-point formats.
116129203Scognet
117129203Scognet-- The usual add, subtract, multiply, divide, and square root operations
118129203Scognet   for all floating-point formats.
119129203Scognet
120129203Scognet-- For each format, the floating-point remainder operation defined by the
121129203Scognet   IEC/IEEE Standard.
122129203Scognet
123129203Scognet-- For each floating-point format, a ``round to integer'' operation that
124129203Scognet   rounds to the nearest integer value in the same format.  (The floating-
125129203Scognet   point formats can hold integer values, of course.)
126129203Scognet
127129203Scognet-- Comparisons between two values in the same floating-point format.
128129203Scognet
129129203ScognetThe only functions required by the IEC/IEEE Standard that are not provided
130129203Scognetare conversions to and from decimal.
131129203Scognet
132129203Scognet
133129203Scognet-------------------------------------------------------------------------------
134129203ScognetRounding Modes
135129203Scognet
136129203ScognetAll four rounding modes prescribed by the IEC/IEEE Standard are implemented
137129203Scognetfor all operations that require rounding.  The rounding mode is selected
138129203Scognetby the global variable `float_rounding_mode'.  This variable may be set
139129203Scognetto one of the values `float_round_nearest_even', `float_round_to_zero',
140129203Scognet`float_round_down', or `float_round_up'.  The rounding mode is initialized
141129203Scognetto nearest/even.
142129203Scognet
143129203Scognet
144129203Scognet-------------------------------------------------------------------------------
145129203ScognetExtended Double-Precision Rounding Precision
146129203Scognet
147129203ScognetFor extended double precision (`floatx80') only, the rounding precision
148129203Scognetof the standard arithmetic operations is controlled by the global variable
149129203Scognet`floatx80_rounding_precision'.  The operations affected are:
150129203Scognet
151129203Scognet   floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
152129203Scognet
153129203ScognetWhen `floatx80_rounding_precision' is set to its default value of 80, these
154129203Scognetoperations are rounded (as usual) to the full precision of the extended
155129203Scognetdouble-precision format.  Setting `floatx80_rounding_precision' to 32
156129203Scognetor to 64 causes the operations listed to be rounded to reduced precision
157129203Scognetequivalent to single precision (`float32') or to double precision
158129203Scognet(`float64'), respectively.  When rounding to reduced precision, additional
159129203Scognetbits in the result significand beyond the rounding point are set to zero.
160129203ScognetThe consequences of setting `floatx80_rounding_precision' to a value other
161129203Scognetthan 32, 64, or 80 is not specified.  Operations other than the ones listed
162129203Scognetabove are not affected by `floatx80_rounding_precision'.
163129203Scognet
164129203Scognet
165129203Scognet-------------------------------------------------------------------------------
166129203ScognetExceptions and Exception Flags
167129203Scognet
168129203ScognetAll five exception flags required by the IEC/IEEE Standard are
169129203Scognetimplemented.  Each flag is stored as a unique bit in the global variable
170129203Scognet`float_exception_flags'.  The positions of the exception flag bits within
171129203Scognetthis variable are determined by the bit masks `float_flag_inexact',
172129203Scognet`float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
173129203Scognet`float_flag_invalid'.  The exception flags variable is initialized to all 0,
174129203Scognetmeaning no exceptions.
175129203Scognet
176129203ScognetAn individual exception flag can be cleared with the statement
177129203Scognet
178129203Scognet    float_exception_flags &= ~ float_flag_<exception>;
179129203Scognet
180129203Scognetwhere `<exception>' is the appropriate name.  To raise a floating-point
181129203Scognetexception, the SoftFloat function `float_raise' should be used (see below).
182129203Scognet
183129203ScognetIn the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
184129203Scognetfor underflow either before or after rounding.  The choice is made by
185129203Scognetthe global variable `float_detect_tininess', which can be set to either
186129203Scognet`float_tininess_before_rounding' or `float_tininess_after_rounding'.
187129203ScognetDetecting tininess after rounding is better because it results in fewer
188129203Scognetspurious underflow signals.  The other option is provided for compatibility
189129203Scognetwith some systems.  Like most systems, SoftFloat always detects loss of
190129203Scognetaccuracy for underflow as an inexact result.
191129203Scognet
192129203Scognet
193129203Scognet-------------------------------------------------------------------------------
194129203ScognetFunction Details
195129203Scognet
196129203Scognet- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
197129203ScognetConversion Functions
198129203Scognet
199129203ScognetAll conversions among the floating-point formats are supported, as are all
200129203Scognetconversions between a floating-point format and 32-bit and 64-bit signed
201129203Scognetintegers.  The complete set of conversion functions is:
202129203Scognet
203129203Scognet   int32_to_float32      int64_to_float32
204129203Scognet   int32_to_float64      int64_to_float32
205129203Scognet   int32_to_floatx80     int64_to_floatx80
206129203Scognet   int32_to_float128     int64_to_float128
207129203Scognet
208129203Scognet   float32_to_int32      float32_to_int64
209129203Scognet   float32_to_int32      float64_to_int64
210129203Scognet   floatx80_to_int32     floatx80_to_int64
211129203Scognet   float128_to_int32     float128_to_int64
212129203Scognet
213129203Scognet   float32_to_float64    float32_to_floatx80   float32_to_float128
214129203Scognet   float64_to_float32    float64_to_floatx80   float64_to_float128
215129203Scognet   floatx80_to_float32   floatx80_to_float64   floatx80_to_float128
216129203Scognet   float128_to_float32   float128_to_float64   float128_to_floatx80
217129203Scognet
218129203ScognetEach conversion function takes one operand of the appropriate type and
219129203Scognetreturns one result.  Conversions from a smaller to a larger floating-point
220129203Scognetformat are always exact and so require no rounding.  Conversions from 32-bit
221129203Scognetintegers to double precision and larger formats are also exact, and likewise
222129203Scognetfor conversions from 64-bit integers to extended double and quadruple
223129203Scognetprecisions.
224129203Scognet
225129203ScognetConversions from floating-point to integer raise the invalid exception if
226129203Scognetthe source value cannot be rounded to a representable integer of the desired
227129203Scognetsize (32 or 64 bits).  If the floating-point operand is a NaN, the largest
228129203Scognetpositive integer is returned.  Otherwise, if the conversion overflows, the
229129203Scognetlargest integer with the same sign as the operand is returned.
230129203Scognet
231129203ScognetOn conversions to integer, if the floating-point operand is not already an
232129203Scognetinteger value, the operand is rounded according to the current rounding
233129203Scognetmode as specified by `float_rounding_mode'.  Because C (and perhaps other
234129203Scognetlanguages) require that conversions to integers be rounded toward zero, the
235129203Scognetfollowing functions are provided for improved speed and convenience:
236129203Scognet
237129203Scognet   float32_to_int32_round_to_zero    float32_to_int64_round_to_zero
238129203Scognet   float64_to_int32_round_to_zero    float64_to_int64_round_to_zero
239129203Scognet   floatx80_to_int32_round_to_zero   floatx80_to_int64_round_to_zero
240129203Scognet   float128_to_int32_round_to_zero   float128_to_int64_round_to_zero
241129203Scognet
242129203ScognetThese variant functions ignore `float_rounding_mode' and always round toward
243129203Scognetzero.
244129203Scognet
245129203Scognet- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
246129203ScognetStandard Arithmetic Functions
247129203Scognet
248129203ScognetThe following standard arithmetic functions are provided:
249129203Scognet
250129203Scognet   float32_add    float32_sub    float32_mul    float32_div    float32_sqrt
251129203Scognet   float64_add    float64_sub    float64_mul    float64_div    float64_sqrt
252129203Scognet   floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
253129203Scognet   float128_add   float128_sub   float128_mul   float128_div   float128_sqrt
254129203Scognet
255129203ScognetEach function takes two operands, except for `sqrt' which takes only one.
256129203ScognetThe operands and result are all of the same type.
257129203Scognet
258129203ScognetRounding of the extended double-precision (`floatx80') functions is affected
259129203Scognetby the `floatx80_rounding_precision' variable, as explained above in the
260129203Scognetsection _Extended_Double-Precision_Rounding_Precision_.
261129203Scognet
262129203Scognet- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
263129203ScognetRemainder Functions
264129203Scognet
265129203ScognetFor each format, SoftFloat implements the remainder function according to
266129203Scognetthe IEC/IEEE Standard.  The remainder functions are:
267129203Scognet
268129203Scognet   float32_rem
269129203Scognet   float64_rem
270129203Scognet   floatx80_rem
271129203Scognet   float128_rem
272129203Scognet
273129203ScognetEach remainder function takes two operands.  The operands and result are all
274129203Scognetof the same type.  Given operands x and y, the remainder functions return
275129203Scognetthe value x - n*y, where n is the integer closest to x/y.  If x/y is exactly
276129203Scognethalfway between two integers, n is the even integer closest to x/y.  The
277129203Scognetremainder functions are always exact and so require no rounding.
278129203Scognet
279129203ScognetDepending on the relative magnitudes of the operands, the remainder
280129203Scognetfunctions can take considerably longer to execute than the other SoftFloat
281129203Scognetfunctions.  This is inherent in the remainder operation itself and is not a
282129203Scognetflaw in the SoftFloat implementation.
283129203Scognet
284129203Scognet- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
285129203ScognetRound-to-Integer Functions
286129203Scognet
287129203ScognetFor each format, SoftFloat implements the round-to-integer function
288129203Scognetspecified by the IEC/IEEE Standard.  The functions are:
289129203Scognet
290129203Scognet   float32_round_to_int
291129203Scognet   float64_round_to_int
292129203Scognet   floatx80_round_to_int
293129203Scognet   float128_round_to_int
294129203Scognet
295129203ScognetEach function takes a single floating-point operand and returns a result of
296129203Scognetthe same type.  (Note that the result is not an integer type.)  The operand
297129203Scognetis rounded to an exact integer according to the current rounding mode, and
298129203Scognetthe resulting integer value is returned in the same floating-point format.
299129203Scognet
300129203Scognet- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
301129203ScognetComparison Functions
302129203Scognet
303129203ScognetThe following floating-point comparison functions are provided:
304129203Scognet
305129203Scognet   float32_eq    float32_le    float32_lt
306129203Scognet   float64_eq    float64_le    float64_lt
307129203Scognet   floatx80_eq   floatx80_le   floatx80_lt
308129203Scognet   float128_eq   float128_le   float128_lt
309129203Scognet
310129203ScognetEach function takes two operands of the same type and returns a 1 or 0
311129203Scognetrepresenting either _true_ or _false_.  The abbreviation `eq' stands for
312129203Scognet``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
313129203Scognetfor ``less than'' (<).
314129203Scognet
315129203ScognetThe standard greater-than (>), greater-than-or-equal (>=), and not-equal
316129203Scognet(!=) functions are easily obtained using the functions provided.  The
317129203Scognetnot-equal function is just the logical complement of the equal function.
318129203ScognetThe greater-than-or-equal function is identical to the less-than-or-equal
319129203Scognetfunction with the operands reversed; and the greater-than function can be
320129203Scognetobtained from the less-than function in the same way.
321129203Scognet
322129203ScognetThe IEC/IEEE Standard specifies that the less-than-or-equal and less-than
323129203Scognetfunctions raise the invalid exception if either input is any kind of NaN.
324129203ScognetThe equal functions, on the other hand, are defined not to raise the invalid
325129203Scognetexception on quiet NaNs.  For completeness, SoftFloat provides the following
326129203Scognetadditional functions:
327129203Scognet
328129203Scognet   float32_eq_signaling    float32_le_quiet    float32_lt_quiet
329129203Scognet   float64_eq_signaling    float64_le_quiet    float64_lt_quiet
330129203Scognet   floatx80_eq_signaling   floatx80_le_quiet   floatx80_lt_quiet
331129203Scognet   float128_eq_signaling   float128_le_quiet   float128_lt_quiet
332129203Scognet
333129203ScognetThe `signaling' equal functions are identical to the standard functions
334129203Scognetexcept that the invalid exception is raised for any NaN input.  Likewise,
335129203Scognetthe `quiet' comparison functions are identical to their counterparts except
336129203Scognetthat the invalid exception is not raised for quiet NaNs.
337129203Scognet
338129203Scognet- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
339129203ScognetSignaling NaN Test Functions
340129203Scognet
341129203ScognetThe following functions test whether a floating-point value is a signaling
342129203ScognetNaN:
343129203Scognet
344129203Scognet   float32_is_signaling_nan
345129203Scognet   float64_is_signaling_nan
346129203Scognet   floatx80_is_signaling_nan
347129203Scognet   float128_is_signaling_nan
348129203Scognet
349129203ScognetThe functions take one operand and return 1 if the operand is a signaling
350129203ScognetNaN and 0 otherwise.
351129203Scognet
352129203Scognet- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
353129203ScognetRaise-Exception Function
354129203Scognet
355129203ScognetSoftFloat provides a function for raising floating-point exceptions:
356129203Scognet
357129203Scognet    float_raise
358129203Scognet
359129203ScognetThe function takes a mask indicating the set of exceptions to raise.  No
360129203Scognetresult is returned.  In addition to setting the specified exception flags,
361129203Scognetthis function may cause a trap or abort appropriate for the current system.
362129203Scognet
363129203Scognet- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
364129203Scognet
365129203Scognet
366129203Scognet-------------------------------------------------------------------------------
367129203ScognetContact Information
368129203Scognet
369129203ScognetAt the time of this writing, the most up-to-date information about
370129203ScognetSoftFloat and the latest release can be found at the Web page `http://
371129203ScognetHTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.
372129203Scognet
373129203Scognet
374