#
256281 |
|
10-Oct-2013 |
gjb |
Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation |
#
239192 |
|
11-Aug-2012 |
dim |
Change a few extern inline functions in libm to static inline, since they need to refer to static constants, which C99 does not allow for extern inline functions.
While here, change a comment in e_rem_pio2f.c to mention the correct number of bits.
Reviewed by: bde MFC after: 1 week
|
#
193368 |
|
03-Jun-2009 |
ed |
Use ISO C99 style inline semantics in msun.
Because we use ISO C99 nowadays, we can just get rid of enforcing GNU89-style inlining.
|
#
187128 |
|
13-Jan-2009 |
das |
Use __gnu89_inline so that these files will compile with newer versions of gcc, where the meaning of 'inline' was changed to match C99.
Noticed by: rdivacky
|
#
176451 |
|
22-Feb-2008 |
das |
s/rcsid/__FBSDID/
|
#
152881 |
|
28-Nov-2005 |
bde |
Rearranged the polynomial evaluation some more to reduce dependencies. Instead of echoing the code in a comment, try to describe why we split up the evaluation in a special way.
The new optimization is mostly to move the evaluation of w = z*z later so that everything else (except z = x*x) doesn't have to wait for w. On Athlons, FP multiplication has a latency of 4 cycles so this optimization saves 4 cycles per call provided no new dependencies are introduced. Tweaking the other terms in to reduce dependencies saves a couple more cycles in some cases (more on AXP than on A64; up to 8 cycles out of 56 altogether in some cases). The previous version had a similar optimization for s = z*x. Special optimizations like these probably have a larger effect than the simple 2-way vectorization permitted (but not activated by gcc) in the old version, since 2-way vectorization is not enough and the polynomial's degree is so small in the float case that non-vectorizable dependencies dominate.
On an AXP, tanf() on uniformly distributed args in [-2pi, 2pi] now takes 34-55 cycles (was 39-59 cycles).
|
#
152870 |
|
28-Nov-2005 |
bde |
Changed spelling of the request-to-inline macro name to match the change of the function name.
Added my (non-)copyright.
In k_tanf.c, added the first set of redundant parentheses to control grouping which was claimed to be added in the previous commit.
|
#
152766 |
|
24-Nov-2005 |
bde |
Minor cleanups and optimizations:
- Remove dead code that I forgot to remove in the previous commit.
- Calculate the sum of the lower terms of the polynomial (divided by x**5) in a single expression (sum of odd terms) + (sum of even terms) with parentheses to control grouping. This is clearer and happens to give better instruction scheduling for a tiny optimization (an average of about ~0.5 cycles/call on Athlons).
- Calculate the final sum in a single expression with parentheses to control grouping too. Change the grouping from first_term + (second_term + sum_of_lower_terms) to (first_term + second_term) + sum_of_lower_terms. Normally the first grouping must be used for accuracy, but extra precision makes any grouping give a correct result so we can group for efficiency. This is a larger optimization (average 3-4 cycles/call or 5%).
- Use parentheses to indicate that the C order of left to right evaluation is what is wanted (for efficiency) in a multiplication too.
The old fdlibm code has several optimizations related to these. 2 involve doing an extra operation that can be done almost in parallel on some superscalar machines but are pessimizations on sequential machines. Others involve statement ordering or expression grouping. All of these except the ordering for the combining the sums of the odd and even terms seem to be ideal for Athlons, but parallelism is still limited so all of these optimizations combined together with the ones in this commit save only ~6-8 cycles (~10%).
On an AXP, tanf() on uniformly distributed args in [-2pi, 2pi] now takes 39-59 cycles. I don't know of any more optimizations for tanf() short of writing it all in asm with very MD instruction scheduling. Hardware fsin takes 122-138 cycles. Most of the optimizations for tanf() don't work very well for tan[l](). fdlibm tan() now takes 145-365 cycles.
|
#
152741 |
|
24-Nov-2005 |
bde |
Optimized by eliminating the special case for 0.67434 <= |x| < pi/4.
A single polynomial approximation for tan(x) works in infinite precision up to |x| < pi/2, but in finite precision, to restrict the accumulated roundoff error to < 1 ulp, |x| must be restricted to less than about sqrt(0.5/((1.5+1.5)/3)) ~= 0.707. We restricted it a bit more to give a safety margin including some slop for optimizations. Now that we use double precision for the calculations, the accumulated roundoff error is in double-precision ulps so it can easily be made almost 2**29 times smaller than a single-precision ulp. Near x = pi/4 its maximum is about 0.5+(1.5+1.5)*x**2/3 ~= 1.117 double-precision ulps.
The minimax polynomial needs to be different to work for the larger interval. I didn't increase its degree the old degree is just large enough to keep the final error less than 1 ulp and increasing the degree would be a pessimization. The maximum error is now ~0.80 ulps instead of ~0.53 ulps.
The speedup from this optimization for uniformly distributed args in [-2pi, 2pi] is 28-43% on athlons, depending on how badly gcc selected and scheduled the instructions in the old version. The old version has some int-to-float conversions that are apparently difficult to schedule well, but gcc-3.3 somehow did everything ~10 cycles or ~10% faster than gcc-3.4, with the difference especially large on AXPs. On A64s, the problem seems to be related to documented penalties for moving single precision data to undead xmm registers. With this version, the speed is cycles is almost independent of the athlon and gcc version despite the large differences in instruction selection to use the FPU on AXPs and SSE on A64s.
|
#
152713 |
|
23-Nov-2005 |
bde |
Use only double precision for "kernel" tanf (except for returning float). This is a minor interface change. The function is renamed from __kernel_tanf() to __kernel_tandf() so that misues of it will cause link errors and not crashes.
This version is a routine translation with no special optimizations for accuracy or efficiency. It gives an unimportant increase in accuracy, from ~0.9 ulps to 0.5285 ulps. Almost all of the error is from the minimax polynomial (~0.03 ulps and the final rounding step (< 0.5 ulps). It gives strange differences in efficiency in the -5 to +10% range, with -O1 fairly consistently becoming faster and -O2 slower on AXP and A64 with gcc-3.3 and gcc-3.4.
|
#
152647 |
|
21-Nov-2005 |
bde |
Mess up the "kernel" float trig function .c files with ifdefs so that they can be #included in other .c files to give inline functions, and use them to inline the functions in most callers (not in e_lgammaf_r.c). __kernel_tanf() is too large and complicated for gcc to inline very well.
An athlons, this gives a speed increase under favourable pipeline conditions of about 10% overall (larger for AXP, smaller for A64). E.g., on AXP, sinf() on uniformly distributed args in [-2Pi, 2Pi] now takes 30-56 cycles; it used to take 45-61 cycles; hardware fsin takes 65-129.
|
#
152641 |
|
20-Nov-2005 |
bde |
Use double precision to simplify and optimize a long division.
On athlons, this gives a speedup of 10-20% for tanf() on uniformly distributed args in [-2Pi, 2Pi]. (It only directly applies for 43% of the args and gives a 16-20% speedup for these (more for AXP than A64) and this gives an overall speedup of 10-12% which is all that it should; however, it gives an overall speedup of 17-20% with gcc-3.3 on AXP-A64 by mysteriously effected cases where it isn't executed.)
I originally intended to use double precision for all internals of float trig functions and will probably still do this, but benchmarking showed that converting to double precision and back is a pessimization in cases where a simple float precision calculation works, so it may be optimal to switch precisions only when using extra precision is much simpler.
|
#
152343 |
|
12-Nov-2005 |
bde |
Imoproved comments for the minimax polynomial.
Removed an unused variable.
Fixed some wrong comments and some nearby misformatting.
|
#
152284 |
|
10-Nov-2005 |
bde |
As for __kernel_cosf() and __kernel_sinf(), use a fairly optimal minimax polynomial for __kernel_tanf(). The old one was the double-precision polynomial with coefficients truncated to float. Truncation is not a good way to convert minimax polynomials to lower precision. Optimize for efficiency and use the lowest-degree polynomial that gives a relative error of less than 1 ulp. It has degree 13 instead of 27, and happens to be 2.5 times more accurate (in infinite precision) than the old polynomial (the maximum error is 0.017 ulps instead of 0.041 ulps).
Unlike for cosf and sinf, the old accuracy was close to being inadequate -- the polynomial for double precision has a max error of 0.014 ulps and nearly this small an error is needed. The new accuracy is also a bit small, but exhaustive checking shows that even the old accuracy was enough. The increased accuracy reduces the maximum relative error in the final result on amd64 -O1 from 0.9588 ulps to 0.9044 ulps.
|
#
151969 |
|
02-Nov-2005 |
bde |
Moved the optimization for tiny x from __kernel_tan[f](x) to tan[f](x) so that it can be faster for tiny x and avoided for reduced x.
This improves things a little differently than for cosine and sine. We still need to reclassify x in the "kernel" functions, but we get an extra optimization for tiny x, and an overall optimization since tiny reduced x rarely happens. We also get optimizations for space and style. A large block of poorly duplicated code to fix a special case is no longer needed. This supersedes the fixes in k_sin.c revs 1.9 and 1.11 and k_sinf.c 1.8 and 1.10.
Fixed wrong constant for the cutoff for "tiny" in tanf(). It was 2**-28, but should be almost the same as the cutoff in sinf() (2**-12). The incorrect cutoff protected us from the bugs fixed in k_sinf.c 1.8 and 1.10, except 4 cases of reduced args passed the cutoff and needed special handling in theory although not in practice. Now we essentially use a cutoff of 0 for the case of reduced args, so we now have 0 special args instead of 4.
This change makes no difference to the results for sinf() (since it only changes the algorithm for the 4 special args and the results for those happen not to change), but it changes lots of results for sin(). Exhaustive testing is impossible for sin(), but exhaustive testing for sinf() (relative to a version with the old algorithm and a fixed cutoff) shows that the changes in the error are either reductions or from 0.5-epsilon ulps to 0.5+epsilon ulps. The new method just uses some extra terms in approximations so it tends to give more accurate results, and there are apparently no problems from having extra accuracy. On amd64 with -O1, on all float args the error range in ulps is reduced from (0.500, 0.665] to [0.335, 0.500) in 24168 cases and increased from 0.500-epsilon to 0.500+epsilon in 24 cases. Non- exhaustive testing by ucbtest shows no differences.
|
#
151963 |
|
02-Nov-2005 |
bde |
Removed dead code for handling tan[f]() on odd multiples of pi/2. This case never occurs since pi/2 is irrational so no multiple of it can be represented as a float and we have precise arg reduction so we never end up with a remainder of 0 in the "kernel" function unless the original arg is 0.
If this case occurs, then we would now fall through to general code that returns +-Inf (depending on the sign of the reduced arg) instead of forcing +Inf. The correct handling would be to return NaN since we would have lost so much precision that the correct result can be anything _except_ +-Inf.
Don't reindent the else clause left over from this, although it was already bogusly indented ("if (foo) return; else ..." just marches the indentation to the right), since it will be removed too.
Index: k_tan.c =================================================================== RCS file: /home/ncvs/src/lib/msun/src/k_tan.c,v retrieving revision 1.10 diff -r1.10 k_tan.c 88,90c88 < if (((ix | low) | (iy + 1)) == 0) < return one / fabs(x); < else { --- > {
|
#
151961 |
|
02-Nov-2005 |
bde |
Fixed some of the silliness related to rev.1.8. In 1.8, "double" in a declaration was not translated to "float" although bit fiddling on double variables was translated. This resulted in garbage being put into the low word of one of the doubles instead of non-garbage being put into the only word of the intended float. This had no effect on any result because: - with doubles, the algorithm for calculating -1/(x+y) is unnecessarily complicated. Just returning -1/((double)x+y) would work, and the misdeclaration gave something like that except for messing up some low bits with the bit fiddling. - doubles have plenty of bits to spare so messing up some of the low bits is unlikely to matter. - due to other bugs, the buggy code is reached for a whole 4 args out of all 2**32 float args. The bug fixed by 1.8 only affects a small percentage of cases and a small percentage of 4 is 0. The 4 args happen to cause no problems without 1.8, so they are even less likely to be affected by the bug in 1.8 than average args; in fact, neither 1.8 nor this commit makes any difference to the result for these 4 args (and thus for all args).
Corrections to the log message in 1.8: the bug only applies to tan() and not tanf(), not because the float type can't represent numbers large enough to trigger the problem (e.g., the example in the fdlibm-5.3 readme which is > 1.0e269), but because: - the float type can't represent small enough numbers. For there to be a possible problem, the original arg for tanf() must lie very near an odd multiple of pi/2. Doubles can get nearer in absolute units. In ulps there should be little difference, but ... - ... the cutoff for "small" numbers is bogus in k_tanf.c. It is still the double value (2**-28). Since this is 32 times smaller than FLT_EPSILON and large float values are not very uniformly distributed, only 6 args other than ones that are initially below the cutoff give a reduced arg that passes the cutoff (the 4 problem cases mentioned above and 2 non-problem cases).
Fixing the cutoff makes the bug affect tanf() and much easier to detect than for tan(). With a cutoff of 2**-12 on amd64 with -O1, 670102 args pass the cutoff; of these, there are 337604 cases where there might be an error of >= 1 ulp and 5826 cases where there is such an error; the maximum error is 1.5382 ulps.
The fix in 1.8 works with the reduced cutoff in all cases despite the bug in it. It changes the result in 84492 cases altogether to fix the 5826 broken cases. Fixing the fix by translating "double" to "float" changes the result in 42 cases relative to 1.8. In 24 cases the (absolute) error is increased and in 18 cases it is reduced, but it remains less than 1 ulp in all cases.
|
#
129981 |
|
02-Jun-2004 |
das |
Port a bugfix from FDLIBM 5.3. The bug really only applies to tan() and not tanf() because float type can't represent numbers large enough to trigger the problem. However, there seems to be a precedent that the float versions of the fdlibm routines should mirror their double counterparts.
Also update to the FDLIBM 5.3 license.
Obtained from: FDLIBM Reviewed by: exhaustive comparison
|
#
97413 |
|
28-May-2002 |
alfred |
Fix formatting, this is hard to explain, so I'll show one example.
- float ynf(int n, float x) /* wrapper ynf */ +float +ynf(int n, float x) /* wrapper ynf */
This is because the __STDC__ stuff was indented.
Reviewed by: md5
|
#
97409 |
|
28-May-2002 |
alfred |
Assume __STDC__, remove non-__STDC__ code.
Reviewed by: md5
|
#
50476 |
|
27-Aug-1999 |
peter |
$Id$ -> $FreeBSD$
|
#
22993 |
|
22-Feb-1997 |
peter |
Revert $FreeBSD$ to $Id$
|
#
21673 |
|
14-Jan-1997 |
jkh |
Make the long-awaited change from $Id$ to $FreeBSD$
This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
|
#
8870 |
|
30-May-1995 |
rgrimes |
Remove trailing whitespace.
|
#
2117 |
|
19-Aug-1994 |
jkh |
This commit was generated by cvs2svn to compensate for changes in r2116, which included commits to RCS files with non-trunk default branches.
|
#
2116 |
|
19-Aug-1994 |
jkh |
J.T. Conklin's latest version of the Sun math library.
-- Begin comments from J.T. Conklin: The most significant improvement is the addition of "float" versions of the math functions that take float arguments, return floats, and do all operations in floating point. This doesn't help (performance) much on the i386, but they are still nice to have.
The float versions were orginally done by Cygnus' Ian Taylor when fdlibm was integrated into the libm we support for embedded systems. I gave Ian a copy of my libm as a starting point since I had already fixed a lot of bugs & problems in Sun's original code. After he was done, I cleaned it up a bit and integrated the changes back into my libm. -- End comments
Reviewed by: jkh Submitted by: jtc
|