about summary refs log tree commit diff
path: root/src/math
Commit message (Collapse)AuthorAgeFilesLines
* add riscv64 architecture supportRich Felker2019-06-1412-0/+180
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Author: Alex Suykov <alex.suykov@gmail.com> Author: Aric Belsito <lluixhi@gmail.com> Author: Drew DeVault <sir@cmpwn.com> Author: Michael Clark <mjc@sifive.com> Author: Michael Forney <mforney@mforney.org> Author: Stefan O'Rear <sorear2@gmail.com> This port has involved the work of many people over several years. I have tried to ensure that everyone with substantial contributions has been credited above; if any omissions are found they will be noted later in an update to the authors/contributors list in the COPYRIGHT file. The version committed here comes from the riscv/riscv-musl repo's commit 3fe7e2c75df78eef42dcdc352a55757729f451e2, with minor changes by me for issues found during final review: - a_ll/a_sc atomics are removed (according to the ISA spec, lr/sc are not safe to use in separate inline asm fragments) - a_cas[_p] is fixed to be a memory barrier - the call from the _start assembly into the C part of crt1/ldso is changed to allow for the possibility that the linker does not place them nearby each other. - DTP_OFFSET is defined correctly so that local-dynamic TLS works - reloc.h LDSO_ARCH logic is simplified and made explicit. - unused, non-functional crti/n asm files are removed. - an empty .sdata section is added to crt1 so that the __global_pointer reference is resolvable. - indentation style errors in some asm files are fixed.
* math: new powSzabolcs Nagy2019-04-173-303/+520
| | | | | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc The underflow exception is signaled if the result is in the subnormal range even if the result is exact. code size change: +3421 bytes. benchmark on x86_64 before, after, speedup: -Os: pow rthruput: 102.96 ns/call 33.38 ns/call 3.08x pow latency: 144.37 ns/call 54.75 ns/call 2.64x -O3: pow rthruput: 98.91 ns/call 32.79 ns/call 3.02x pow latency: 138.74 ns/call 53.78 ns/call 2.58x
* math: new exp and exp2Szabolcs Nagy2019-04-174-480/+434
| | | | | | | | | | | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc TOINT_INTRINSICS and EXP_USE_TOINT_NARROW cases are unused. The underflow exception is signaled if the result is in the subnormal range even if the result is exact (e.g. exp2(-1023.0)). code size change: -1672 bytes. benchmark on x86_64 before, after, speedup: -Os: exp rthruput: 12.73 ns/call 6.68 ns/call 1.91x exp latency: 45.78 ns/call 21.79 ns/call 2.1x exp2 rthruput: 6.35 ns/call 5.26 ns/call 1.21x exp2 latency: 26.00 ns/call 16.58 ns/call 1.57x -O3: exp rthruput: 12.75 ns/call 6.73 ns/call 1.89x exp latency: 45.91 ns/call 21.80 ns/call 2.11x exp2 rthruput: 6.47 ns/call 5.40 ns/call 1.2x exp2 latency: 26.03 ns/call 16.54 ns/call 1.57x
* math: new log2Szabolcs Nagy2019-04-173-106/+335
| | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc code size change: +2458 bytes (+1524 bytes with fma). benchmark on x86_64 before, after, speedup: -Os: log2 rthruput: 16.08 ns/call 10.49 ns/call 1.53x log2 latency: 44.54 ns/call 25.55 ns/call 1.74x -O3: log2 rthruput: 15.92 ns/call 10.11 ns/call 1.58x log2 latency: 44.66 ns/call 26.16 ns/call 1.71x
* math: new logSzabolcs Nagy2019-04-173-104/+454
| | | | | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc Assume __FP_FAST_FMA implies __builtin_fma is inlined as a single instruction. code size change: +4588 bytes (+2540 bytes with fma). benchmark on x86_64 before, after, speedup: -Os: log rthruput: 12.61 ns/call 7.95 ns/call 1.59x log latency: 41.64 ns/call 23.38 ns/call 1.78x -O3: log rthruput: 12.51 ns/call 7.75 ns/call 1.61x log latency: 41.82 ns/call 23.55 ns/call 1.78x
* math: new powfSzabolcs Nagy2019-04-173-240/+226
| | | | | | | | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc POWF_SCALE != 1.0 case only matters if TOINT_INTRINSICS is set, which is currently not supported for any target. SNaN is not supported, it would require an issignalingf implementation. code size change: -816 bytes. benchmark on x86_64 before, after, speedup: -Os: powf rthruput: 95.14 ns/call 20.04 ns/call 4.75x powf latency: 137.00 ns/call 34.98 ns/call 3.92x -O3: powf rthruput: 92.48 ns/call 13.67 ns/call 6.77x powf latency: 131.11 ns/call 35.15 ns/call 3.73x
* math: new exp2f and expfSzabolcs Nagy2019-04-174-179/+177
| | | | | | | | | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc In expf TOINT_INTRINSICS is kept, but is unused, it would require support for __builtin_round and __builtin_lround as single instruction. code size change: +94 bytes. benchmark on x86_64 before, after, speedup: -Os: expf rthruput: 9.19 ns/call 8.11 ns/call 1.13x expf latency: 34.19 ns/call 18.77 ns/call 1.82x exp2f rthruput: 5.59 ns/call 6.52 ns/call 0.86x exp2f latency: 17.93 ns/call 16.70 ns/call 1.07x -O3: expf rthruput: 9.12 ns/call 4.92 ns/call 1.85x expf latency: 34.44 ns/call 18.99 ns/call 1.81x exp2f rthruput: 5.58 ns/call 4.49 ns/call 1.24x exp2f latency: 17.95 ns/call 16.94 ns/call 1.06x
* math: new log2fSzabolcs Nagy2019-04-173-58/+108
| | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc code size change: +177 bytes. benchmark on x86_64 before, after, speedup: -Os: log2f rthruput: 11.38 ns/call 5.99 ns/call 1.9x log2f latency: 35.01 ns/call 22.57 ns/call 1.55x -O3: log2f rthruput: 10.82 ns/call 5.58 ns/call 1.94x log2f latency: 35.13 ns/call 21.04 ns/call 1.67x
* math: new logfSzabolcs Nagy2019-04-173-54/+109
| | | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc, with minor changes to better fit into musl. code size change: +289 bytes. benchmark on x86_64 before, after, speedup: -Os: logf rthruput: 8.40 ns/call 6.14 ns/call 1.37x logf latency: 31.79 ns/call 24.33 ns/call 1.31x -O3: logf rthruput: 8.43 ns/call 5.58 ns/call 1.51x logf latency: 32.04 ns/call 20.88 ns/call 1.53x
* math: add double precision error handling functionsSzabolcs Nagy2019-04-175-0/+30
|
* math: add single precision error handling functionsSzabolcs Nagy2019-04-175-0/+30
| | | | | | | | | | These are supposed to be used in tail call positions when handling special cases in new code. (fp exceptions may be raised "naturally" by the common code path if special casing is more effort.) This implements the error handling apis used in https://github.com/ARM-software/optimized-routines without errno setting.
* fix unintended global symbols in atanl.cDan Gohman2019-04-031-3/+3
| | | | | | | | Mark atanhi, atanlo, and aT in atanl.c as static, as they're not intended to be part of the public API. These are already static in the LDBL_MANT_DIG == 64 code, so this patch is just making the LDBL_MANT_DIG == 113 code do the same thing.
* x86_64: add single instruction fmaSzabolcs Nagy2018-10-154-0/+92
| | | | | | | fma is only available on recent x86_64 cpus and it is much faster than a software fma, so this should be done with a runtime check, however that requires more changes, this patch just adds the code so it can be tested when musl is compiled with -mfma or -mfma4.
* arm: add single instruction fmaSzabolcs Nagy2018-10-152-0/+30
| | | | | | | | | | | | | | vfma is available in the vfpv4 fpu and above, the ACLE standard feature test for double precision hardware fma support is __ARM_FEATURE_FMA && __ARM_FP&8 we need further checks to work around clang bugs (fixed in clang >=7.0) && !__SOFTFP__ because __ARM_FP is defined even with -mfloat-abi=soft && !BROKEN_VFP_ASM to disable the single precision code when inline asm handling is broken. For runtime selection the HWCAP_ARM_VFPv4 hwcap flag can be used, but that requires further work.
* powerpc: add single instruction fabs, fabsf, fma, fmaf, sqrt, sqrtfSzabolcs Nagy2018-10-156-0/+90
| | | | | These are only available on hard float target and sqrt is not available in the base ISA, so further check is used.
* s390x: add single instruction fma and fmafSzabolcs Nagy2018-10-152-0/+14
| | | | These are available in the s390x baseline isa -march=z900.
* reduce spurious inclusion of libc.hRich Felker2018-09-129-9/+0
| | | | | | | | | | | | | | | | | | | | | libc.h was intended to be a header for access to global libc state and related interfaces, but ended up included all over the place because it was the way to get the weak_alias macro. most of the inclusions removed here are places where weak_alias was needed. a few were recently introduced for hidden. some go all the way back to when libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented) cancellation points had to include it. remaining spurious users are mostly callers of the LOCK/UNLOCK macros and files that use the LFS64 macro to define the awful *64 aliases. in a few places, new inclusion of libc.h is added because several internal headers no longer implicitly include libc.h. declarations for __lockfile and __unlockfile are moved from libc.h to stdio_impl.h so that the latter does not need libc.h. putting them in libc.h made no sense at all, since the macros in stdio_impl.h are needed to use them correctly anyway.
* apply hidden visibility to various remaining internal interfacesRich Felker2018-09-121-0/+2
|
* apply hidden visibility to internal math functionsRich Felker2018-09-121-2/+2
| | | | | | | | this makes significant differences to codegen on archs with an expensive PLT-calling ABI; on i386 and gcc 7.3 for example, the sin and sinf functions no longer touch call-saved registers or the stack except for pushing outgoing arguments. performance is likely improved too, but no measurements were taken.
* move lgamma-related internal declarations to libm.hRich Felker2018-09-124-12/+3
|
* add support for m68k 80-bit long double variantRich Felker2018-06-141-2/+10
| | | | | | | | | since x86 and m68k are the only archs with 80-bit long double and each has mandatory endianness, select the variant via endianness. differences are minor: apparently just byte order and representation of infinities. the m68k format is not well-documented anywhere I could find, so if other differences are found they may require additional changes later.
* fix fmaf wrong resultSzabolcs Nagy2018-04-021-1/+1
| | | | | | | | | | | | | if double precision r=x*y+z is not a half way case between two single precision floats or it is an exact result then fmaf returns (float)r. however the exactness check was wrong when |x*y| < |z| and could cause incorrectly rounded result in nearest rounding mode when r is a half way case. fmaf(-0x1.26524ep-54, -0x1.cb7868p+11, 0x1.d10f5ep-29) was incorrectly rounded up to 0x1.d117ap-29 instead of 0x1.d1179ep-29. (exact result is 0x1.d1179efffffffecp-29, r is 0x1.d1179fp-29)
* math: rewrite fma with mostly int arithmeticsSzabolcs Nagy2017-10-131-431/+154
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the freebsd fma code failed to raise underflow exception in some cases in nearest rounding mode (affects fmal too) e.g. fma(-0x1p-1000, 0x1.000001p-74, 0x1p-1022) and the inexact exception may be raised spuriously since the fenv is not saved/restored around the exact multiplication algorithm (affects x86 fma too). another issue is that the underflow behaviour when the rounded result is the minimal normal number is target dependent, ieee754 allows two ways to raise underflow for inexact results: raise if the result before rounding is in the subnormal range (e.g. aarch64, arm, powerpc) or if the result after rounding with infinite exponent range is in the subnormal range (e.g. x86, mips, sh). to avoid all these issues the algorithm was rewritten with mostly int arithmetics and float arithmetics is only used to get correct rounding and raise exceptions according to the behaviour of the target without any fenv.h dependency. it also unifies x86 and non-x86 fma. fmaf is not affected, fmal need to be fixed too. this algorithm depends on a_clz_64 and it required a few spurious instructions to make sure underflow exception is raised in a particular corner case. (normally FORCE_EVAL(tiny*tiny) would be used for this, but on i386 gcc is broken if the expression is constant https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57245 and there is no easy portable fix for the macro.)
* powerpc64: add single-instruction math functionsRich Felker2017-06-2322-0/+290
| | | | | | | | | | while the official elfv2 abi for "powerpc64le" sets power8 as the baseline isa, we use it for both little and big endian powerpc64 targets and need to maintain compatibility with pre-power8 models. the instructions for sqrt, fabs, and fma are in the baseline isa; support for the rest is conditional via predefined isa-level macros. patch by David Edelsohn.
* s390x: add single-instruction math functionsRich Felker2017-06-2324-0/+360
| | | | | | | | these were introduced in z196 and not available in the baseline (z900) ISA level. use __HTM__ as an alternate indicator for ISA level, since gcc did not define __ARCH__ until 7.x. patch by David Edelsohn.
* fix scalbn when result is in the subnormal rangeSzabolcs Nagy2017-04-213-12/+14
| | | | | | | | | | | | | | | | in nearest rounding mode scalbn could introduce double rounding error when an intermediate value and the final result were both in the subnormal range e.g. scalbn(0x1.7ffffffffffffp-1, -1073) returned 0x1p-1073 instead of 0x1p-1074, because the intermediate computation got rounded to 0x1.8p-1023. with the fix an intermediate value can only be in the subnormal range if the final result is 0 which is correct even after double rounding. (there still can be two roundings so signals may be raised twice, but that's only observable with trapping exceptions which is not supported.)
* aarch64: add single instruction math functionsSzabolcs Nagy2017-03-2134-24/+226
| | | | | | | | | | | | | | | | | | | | | | | | this should increase performance and reduce code size on aarch64. the compiled code was checked against using __builtin_* instead of inline asm with gcc-6.2.0. lrint is two instructions. c with inline asm is used because it is safer than a pure asm implementation, this prevents ll{rint,round} to be an alias of l{rint,round} (because the types don't match) and depends on gcc style inline asm support. ceil, floor, round, trunc can either raise inexact on finite non-integer inputs or not raise any exceptions. the new implementation does not raise exceptions while the generic c code does. on aarch64, the underflow exception is signaled before rounding (ieee 754 allows both before and after rounding, but it must be consistent), the generic fma c code signals it after rounding so using single instruction fixes a slight conformance issue too.
* fix threshold constants in j0f, y0f, j1f, y1fSzabolcs Nagy2017-03-152-13/+12
| | | | | | | | | | | | | partly following freebsd rev 279491 https://svnweb.freebsd.org/base?view=revision&revision=279491 (musl had some of the fixes before freebsd). the change should not matter much for j0f, y0f, but it improves j1f and y1f in [2.5,~3.75] (that is [0x40200000,~0x40700000]). near roots (e.g. around 3.8317 for j1f) there are still large ulp errors. dropped code that tried to raise inexact.
* math: fix pow signed shift ubSzabolcs Nagy2016-10-201-2/+2
| | | | | | | | | | | | j is int32_t and thus j<<31 is undefined if j==1, so j is changed to uint32_t locally as a quick fix, the generated code is not affected. (this is a strict conformance fix, future c standard may allow 1<<31, see DR 463. the bug was inherited from freebsd fdlibm, the proper fix is to use uint32_t for all bit hacks, but that requires more intrusive changes.) reported by Daniel Sabogal
* math: fix 128bit long double inverse trigonometric functionsSzabolcs Nagy2016-08-301-1/+1
| | | | | | | | | there was a copy paste error that could cause large ulp errors in atan2l, atanl, asinl and acosl on aarch64, mips64 and mipsn32. (the implementation is from freebsd fdlibm, but the tail end of the polynomial was wrong. 128 bit long double functions are not yet tested so this went undetected.)
* math: fix expf(-NAN) and exp2f(-NAN) to return -NAN instead of 0Szabolcs Nagy2016-03-042-0/+4
| | | | | | expf(-NAN) was treated as expf(-large) which unconditionally returns +0, so special case +-NAN. reported by Petr Hosek.
* work around regression building for armhf with clang (compiler bug)Rich Felker2016-02-192-2/+2
| | | | | | | | | commit e4355bd6bec89688e8c739cd7b4c76e675643dca moved the math asm from external source files to inline asm, but unfortunately, all current releases of clang use the wrong inline asm constraint codes for float and double ("w" and "P" instead of "t" and "w", respectively). this patch adds detection for the bug in configure, and, for now, just disables the affected asm on broken clang versions.
* improve macro logic for enabling arm math asmRich Felker2016-02-182-2/+2
| | | | | | | | | in order to take advantage of the fpu in -mfloat-abi=softfp mode, the __VFP_FP__ (presence of vfp fpu) was checked instead of checking for __ARM_PCS_VFP (hardfloat EABI variant). however, the latter macro is the one that's actually specified by the ABI documents rather than being compiler-specific, and should also be checked in case __VFP_FP__ is not defined on some compilers or some configurations.
* replace armhf math asm source files with inline asmRich Felker2016-01-2016-40/+60
| | | | | | | | | | | | this makes it possible to inline them with LTO, and is the simplest approach to eliminating the use of .sub files. this also makes VFP sqrt available for use with the standard EABI (plain arm rather than armhf subarch) when libc is built with -mfloat-abi=softfp. the same could have been done for fabs, but when the argument and return value are in integer registers, moving to VFP registers and back is almost certainly more costly than a simple integer operation.
* math: explicitly promote expressions to excess-precision typesRich Felker2015-11-213-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | a conforming compiler for an arch with excess precision floating point (FLT_EVAL_METHOD!=0; presently i386 is the only such arch supported) computes all intermediate results in the types float_t and double_t rather than the nominal type of the expression. some incorrect compilers, however, only keep excess precision in registers, and convert down to the nominal type when spilling intermediate results to memory, yielding unpredictable results that depend on the compiler's choices of what/when to spill. in particular, this happens on old gcc versions with -ffloat-store, which we need in order to work around bugs where the compiler wrongly keeps explicitly-dropped excess precision. by explicitly converting to double_t where expressions are expected be be evaluated in double_t precision, we can avoid depending on the compiler to get types correct when spilling; the nominal and intermediate precision now match. this commit should not change the code generated by correct compilers, or by old ones on non-i386 archs where double_t is defined as double. this fixes a serious bug in argument reduction observed on i386 with gcc 4.2: for values of x outside the unit circle, sin(x) was producing results outside the interval [-1,1]. changes made in commit 0ce946cf808274c2d6e5419b139e130c8ad4bd30 were likely responsible for breaking compatibility with this and other old gcc versions. patch by Szabolcs Nagy.
* explicitly assemble all arm asm sources as UALRich Felker2015-11-104-0/+4
| | | | | | | | these files are all accepted as legacy arm syntax when producing arm code, but legacy syntax cannot be used for producing thumb2 with access to the full ISA. even after switching to UAL, some asm source files contain instructions which are not valid in thumb mode, so these will need to be addressed separately.
* declare fpu usage to the assembler in arm hard-float asm filesSzabolcs Nagy2015-10-194-0/+4
| | | | | | | Some armhf gcc toolchains (built with --with-float=hard but without --with-fpu=vfp*) do not pass -mfpu=vfp to the assembler and then binutils rejects the UAL mnemonics for VFP unless there is an .fpu vfp directive in the asm source.
* fix regression in x86_64 math asm with old binutilsRich Felker2015-04-232-6/+6
| | | | | | | | | the implicit-operand form of fucomip is rejected by binutils 2.19 and perhaps other versions still in use. writing both operands explicitly fixes the issue. there is no change to the resulting output. commit a732e80d33b4fd6f510f7cec4f5573ef5d89bc4e was the source of this regression.
* remove potentially PIC-incompatible relocations from x86_64 and x32 asmRich Felker2015-04-182-2/+2
| | | | analogous to commit 8ed66ecbcba1dd0f899f22b534aac92a282f42d5 for i386.
* remove the last of possible-textrels from i386 asmRich Felker2015-04-182-1/+5
| | | | | | | | | | | | none of these are actual textrels because of ld-time binding performed by -Bsymbolic-functions, but I'm changing them with the goal of making ld-time binding purely an optimization rather than relying on it for semantic purposes. in the case of memmove's call to memcpy, making it explicit that the memmove asm is assuming the forward-copying behavior of the memcpy asm is desirable anyway; in case memcpy is ever changed, the semantic mismatch would be apparent while editing memmcpy.s.
* math: fix pow(+-0,-inf) not to raise divbyzero flagSzabolcs Nagy2015-04-183-3/+3
| | | | | | this reverts the commit f29fea00b5bc72d4b8abccba2bb1e312684d1fce which was based on a bug in C99 and POSIX and did not match IEEE-754 http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1515.pdf
* add aarch64 portSzabolcs Nagy2015-03-114-0/+24
| | | | | | | | | | This adds complete aarch64 target support including bigendian subarch. Some of the long double math functions are known to be broken otherwise interfaces should be fully functional, but at this point consider this port experimental. Initial work on this port was done by Sireesh Tripurari and Kevin Bortis.
* math: add dummy implementations of 128 bit long double functionsSzabolcs Nagy2015-03-1116-4/+97
| | | | | | | | This is in preparation for the aarch64 port only to have the long double math symbols available on ld128 platforms. The implementations should be fixed up later once we have proper tests for these functions. Added bigendian handling for ld128 bit manipulations too.
* math: add ld128 exp2l based on the freebsd implementationSzabolcs Nagy2015-03-111-1/+366
| | | | | Changed the special case handling and bit manipulation to better match the double version.
* math: fix fmodl for IEEE binary128Szabolcs Nagy2015-02-091-1/+1
| | | | | This trivial copy-paste bug went unnoticed due to lack of testing. No currently supported target archs are affected.
* math: fix __fpclassifyl(-0.0) for IEEE binary128Szabolcs Nagy2015-02-081-3/+2
| | | | | The sign bit was not cleared before checking for 0 so -0.0 was misclassified as FP_SUBNORMAL instead of FP_ZERO.
* add parenthesis in fma.c to clarify intent and silence warningsSzabolcs Nagy2015-02-081-1/+1
|
* math: use fnstsw consistently instead of fstsw in x87 asmSzabolcs Nagy2014-11-0511-11/+11
| | | | | | fnstsw does not wait for pending unmasked x87 floating-point exceptions and it is the same as fstsw when all exceptions are masked which is the only environment libc supports.
* math: fix x86_64 and x32 asm not to use sahf instructionSzabolcs Nagy2014-11-056-28/+14
| | | | | | | | | | | Some early x86_64 cpus (released before 2006) did not support sahf/lahf instructions so they should be avoided (intel manual says they are only supported if CPUID.80000001H:ECX.LAHF-SAHF[bit 0] = 1). The workaround simplifies exp2l and expm1l because fucomip can be used instead of the fucomp;fnstsw;sahf sequence copied from i386. In fmodl and remainderl sahf is replaced by a simple bit test.
* math: use the rounding idiom consistentlySzabolcs Nagy2014-10-3113-58/+89
| | | | | | | | | | | | | | | | | | | | the idiomatic rounding of x is n = x + toint - toint; where toint is either 1/EPSILON (x is non-negative) or 1.5/EPSILON (x may be negative and nearest rounding mode is assumed) and EPSILON is according to the evaluation precision (the type of toint is not very important, because single precision float can represent the 1/EPSILON of ieee binary128). in case of FLT_EVAL_METHOD!=0 this avoids a useless store to double or float precision, and the long double code became cleaner with 1/LDBL_EPSILON instead of ifdefs for toint. __rem_pio2f and __rem_pio2 functions slightly changed semantics: on i386 a double-rounding is avoided so close to half-way cases may get evaluated differently eg. as sin(pi/4-eps) instead of cos(pi/4+eps)