about summary refs log tree commit diff
path: root/src/internal
Commit message (Collapse)AuthorAgeFilesLines
* lift locale lock out of internal __get_localeRich Felker2020-12-091-0/+2
| | | | | | this allows the lock to be shared with setlocale, eliminates repeated per-category lock/unlock in newlocale, and will allow the use of pthread_once in newlocale to be dropped (to be done separately).
* fix regression in pthread_exitRich Felker2020-11-201-1/+2
| | | | | | | | | | | | | | | commit d26e0774a59bb7245b205bc8e7d8b35cc2037095 moved the detach state transition at exit before the thread list lock was taken. this inadvertently allowed pthread_join to race to take the thread list lock first, and proceed with unmapping of the exiting thread's memory. we could fix this by just revering the offending commit and instead performing __vm_wait unconditionally before taking the thread list lock, but that may be costly. instead, bring back the old DT_EXITING vs DT_EXITED state distinction that was removed in commit 8f11e6127fe93093f81a52b15bb1537edc3fc8af, and don't transition to DT_EXITED (a value of 0, which is what pthread_join waits for) until after the lock has been taken.
* lift child restrictions after multi-threaded forkRich Felker2020-11-111-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | as the outcome of Austin Group tracker issue #62, future editions of POSIX have dropped the requirement that fork be AS-safe. this allows but does not require implementations to synchronize fork with internal locks and give forked children of multithreaded parents a partly or fully unrestricted execution environment where they can continue to use the standard library (per POSIX, they can only portably use AS-safe functions). up until recently, taking this allowance did not seem desirable. however, commit 8ed2bd8bfcb4ea6448afb55a941f4b5b2b0398c0 exposed the extent to which applications and libraries are depending on the ability to use malloc and other non-AS-safe interfaces in MT-forked children, by converting latent very-low-probability catastrophic state corruption into predictable deadlock. dealing with the fallout has been a huge burden for users/distros. while it looks like most of the non-portable usage in applications could be fixed given sufficient effort, at least some of it seems to occur in language runtimes which are exposing the ability to run unrestricted code in the child as part of the contract with the programmer. any attempt at fixing such contracts is not just a technical problem but a social one, and is probably not tractable. this patch extends the fork function to take locks for all libc singletons in the parent, and release or reset those locks in the child, so that when the underlying fork operation takes place, the state protected by these locks is consistent and ready for the child to use. locking is skipped in the case where the parent is single-threaded so as not to interfere with legacy AS-safety property of fork in single-threaded programs. lock order is mostly arbitrary, but the malloc locks (including bump allocator in case it's used) must be taken after the locks on any subsystems that might use malloc, and non-AS-safe locks cannot be taken while the thread list lock is held, imposing a requirement that it be taken last.
* move aio implementation details to a proper internal headerRich Felker2020-10-142-3/+9
| | | | | also fix the lack of declaration (and thus hidden visibility) in __stdio_close's use of __aio_close.
* remove long-unused struct __timer from pthread_impl.hRich Felker2020-10-141-5/+0
| | | | | commit 3990c5c6a40440cdb14746ac080d0ecf8d5d6733 removed the last reference.
* move __abort_lock to its own file and drop pointless weak_alias trickRich Felker2020-10-141-0/+2
| | | | | | | | the dummy definition of __abort_lock in sigaction.c was performing exactly the same role that putting the lock in its own source file could and should have been used to achieve. while we're moving it, give it a proper declaration.
* fix fork of processes with active async io contextsRich Felker2020-09-281-0/+2
| | | | | | | | | | | | | | previously, if a file descriptor had aio operations pending in the parent before fork, attempting to close it in the child would attempt to cancel a thread belonging to the parent. this could deadlock, fail, or crash the whole process of the cancellation signal handler was not yet installed in the parent. in addition, further use of aio from the child could malfunction or deadlock. POSIX specifies that async io operations are not inherited by the child on fork, so clear the entire aio fd map in the child, and take the aio map lock (with signals blocked) across the fork so that the lock is kept in a consistent state.
* remove redundant pthread struct members repeated for layout purposesRich Felker2020-08-271-9/+14
| | | | | | | dtv_copy, canary2, and canary_at_end existed solely to match multiple ABI and asm-accessed layouts simultaneously. now that pthread_arch.h can be included before struct __pthread is defined, the struct layout can depend on macros defined by pthread_arch.h.
* deduplicate __pthread_self thread pointer adjustment out of each archRich Felker2020-08-271-0/+2
| | | | | | | | | | | | | | | | the adjustment made is entirely a function of TLS_ABOVE_TP and TP_OFFSET. aside from avoiding repetition of the TP_OFFSET value and arithmetic, this change makes pthread_arch.h independent of the definition of struct __pthread from pthread_impl.h. this in turn will allow inclusion of pthread_arch.h to be moved to the top of pthread_impl.h so that it can influence the definition of the structure. previously, arch files were very inconsistent about the type used for the thread pointer. this change unifies the new __get_tp interface to always use uintptr_t, which is the most correct when performing arithmetic that may involve addresses outside the actual pointed-to object (due to TP_OFFSET).
* deduplicate TP_ADJ logic out of each arch, replace with TP_OFFSETRich Felker2020-08-241-0/+10
| | | | | | the only part of TP_ADJ that was not uniquely determined by TLS_ABOVE_TP was the 0x7000 adjustment used mainly on mips and powerpc variants.
* make h_errno thread-localRich Felker2020-08-241-0/+1
| | | | | | | | | the framework to do this always existed but it was deemed unnecessary because the only [ex-]standard functions using h_errno were not thread-safe anyway. however, some of the nonstandard res_* functions are also supposed to set h_errno to indicate the cause of error, and were unable to do so because it was not thread-safe. this change is a prerequisite for fixing them.
* prefer new socket syscalls, fallback to SYS_socketcall only if neededRich Felker2020-08-081-9/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | a number of users performing seccomp filtering have requested use of the new individual syscall numbers for socket syscalls, rather than the legacy multiplexed socketcall, since the latter has the arguments all in memory where they can't participate in filter decisions. previously, some archs used the multiplexed socketcall if it was historically all that was available, while other archs used the separate syscalls. the intent was that the latter set only include archs that have "always" had separate socket syscalls, at least going back to linux 2.6.0. however, at least powerpc, powerpc64, and sh were wrongly included in this set, and thus socket operations completely failed on old kernels for these archs. with the changes made here, the separate syscalls are always preferred, but fallback code is compiled for archs that also define SYS_socketcall. two such archs, mips (plain o32) and microblaze, define SYS_socketcall despite never having needed it, so it's now undefined by their versions of syscall_arch.h to prevent inclusion of useless fallback code. some archs, where the separate syscalls were only added after the addition of SYS_accept4, lack SYS_accept. because socket calls are always made with zeros in the unused argument positions, it suffices to just use SYS_accept4 to provide a definition of SYS_accept, and this is done to make happy the macro machinery that concatenates the socket call name onto __SC_ and SYS_.
* math: add __math_invalidlSzabolcs Nagy2020-08-051-0/+3
| | | | for targets where long double is different from double.
* fix C implementation of a_clz_32Rich Felker2020-07-051-1/+1
| | | | | | | | | | this broke mallocng size_to_class on archs without a native implementation of a_clz_32. the incorrect logic seems to have been something i derived from a related but distinct log2-type operation. with the change made here, it passes an exhaustive test. as this function is new and presently only used by mallocng, no other functionality was affected.
* add fallback a_clz_32 implementationRich Felker2020-06-111-0/+15
| | | | | | | some archs already have a_clz_32, used to provide a_ctz_32, but it hasn't been mandatory because it's not used anywhere yet. mallocng will need it, however, so add it now. it should probably be optimized better, but doesn't seem to make a difference at present.
* have ldso track replacement of aligned_allocRich Felker2020-06-101-0/+1
| | | | this is in preparation for improving behavior of malloc interposition.
* reintroduce calloc elison of memset for direct-mmapped allocationsRich Felker2020-06-101-0/+1
| | | | | | | | | | | | | | a new weak predicate function replacable by the malloc implementation, __malloc_allzerop, is introduced. by default it's always false; the default version will be used when static linking if the bump allocator was used (in which case performance doesn't matter) or if malloc was replaced by the application. only if the real internal malloc is linked (always the case with dynamic linking) does the real version get used. if malloc was replaced dynamically, as indicated by __malloc_replaced, the predicate function is ignored and conditional-memset is always performed.
* move malloc_impl.h from src/internal to src/mallocRich Felker2020-06-021-43/+0
| | | | | this reflects that it is no longer intended for consumption outside of the malloc implementation.
* move declaration of interfaces between malloc and ldso to dynlink.hRich Felker2020-06-022-4/+4
| | | | | this eliminates consumers of malloc_impl.h outside of the malloc implementation.
* restore lock-skipping for processes that return to single-threaded stateRich Felker2020-05-221-0/+1
| | | | | | | | | the design used here relies on the barrier provided by the first lock operation after the process returns to single-threaded state to synchronize with actions by the last thread that exited. by storing the intent to change modes in the same object used to detect whether locking is needed, it's possible to avoid an extra (possibly costly) memory load after the lock is taken.
* cut down size of some libc struct membersRich Felker2020-05-221-3/+3
| | | | these are all flags that can be single-byte values.
* don't use libc.threads_minus_1 as relaxed atomic for skipping locksRich Felker2020-05-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | after all but the last thread exits, the next thread to observe libc.threads_minus_1==0 and conclude that it can skip locking fails to synchronize with any changes to memory that were made by the last-exiting thread. this can produce data races. on some archs, at least x86, memory synchronization is unlikely to be a problem; however, with the inline locks in malloc, skipping the lock also eliminated the compiler barrier, and caused code that needed to re-check chunk in-use bits after obtaining the lock to reuse a stale value, possibly from before the process became single-threaded. this in turn produced corruption of the heap state. some uses of libc.threads_minus_1 remain, especially for allocation of new TLS in the dynamic linker; otherwise, it could be removed entirely. it's made non-volatile to reflect that the remaining accesses are only made under lock on the thread list. instead of libc.threads_minus_1, libc.threaded is now used for skipping locks. the difference is that libc.threaded is permanently true once an additional thread has been created. this will produce some performance regression in processes that are mostly single-threaded but occasionally creating threads. in the future it may be possible to bring back the full lock-skipping, but more care needs to be taken to produce a safe design.
* move __string_read into vsscanf source fileRich Felker2020-04-171-2/+0
| | | | | | | apparently this function was intended at some point to be used by strto* family as well, and thus was put in its own file; however, as far as I can tell, it's only ever been used by vsscanf. move it to the same file to reduce the number of source files and external symbols.
* fix possible access to uninitialized memory in shgetc (via scanf)Rich Felker2020-04-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | shgetc sets up to be able to perform an "unget" operation without the caller having to remember and pass back the character value, and for this purpose used a conditional store idiom: if (f->rpos[-1] != c) f->rpos[-1] = c to make it safe to use with non-writable buffers (setup by the sh_fromstring macro or __string_read with sscanf). however, validity of this depends on the buffer space at rpos[-1] being initialized, which is not the case under some conditions (including at least unbuffered files and fmemopen ones). whenever data was read "through the buffer", the desired character value is already in place and does not need to be written. thus, rather than testing for the absence of the value, we can test for rpos<=buf, indicating that the last character read could not have come from the buffer, and thereby that we have a "real" buffer (possibly of zero length) with writable pushback (UNGET bytes) below it.
* math: fix sinh overflows in non-nearest roundingSzabolcs Nagy2020-02-211-2/+2
| | | | | | | | | | | | | | The final rounding operation should be done with the correct sign otherwise huge results may incorrectly get rounded to or away from infinity in upward or downward rounding modes. This affected sinh and sinhf which set the sign on the result after a potentially overflowing mul. There may be other non-nearest rounding issues, but this was a known long standing issue with large ulp error (depending on how ulp is defined near infinity). The fix should have no effect on sinh and sinhf performance but may have a tiny effect on cosh and coshf.
* remove legacy time32 timer[fd] syscalls from public syscall.hRich Felker2020-02-051-0/+16
| | | | | | | this extends commit 5a105f19b5aae79dd302899e634b6b18b3dcd0d6, removing timer[fd]_settime and timer[fd]_gettime. the timerfd ones are likely to have been used in software that started using them before it could rely on libc exposing functions.
* remove further legacy time32 clock syscalls from public syscall.hRich Felker2020-02-051-0/+16
| | | | | this extends commit 5a105f19b5aae79dd302899e634b6b18b3dcd0d6, removing clock_settime, clock_getres, clock_nanosleep, and settimeofday.
* remove legacy clock_gettime and gettimeofday from public syscall.hRich Felker2020-01-301-0/+7
| | | | | | | | | | | | | | | | | | | | | | | some nontrivial number of applications have historically performed direct syscalls for these operations rather than using the public functions. such usage is invalid now that time_t is 64-bit and these syscalls no longer match the types they are used with, and it was already harmful before (by suppressing use of vdso). since syscall() has no type safety, incorrect usage of these syscalls can't be caught at compile-time. so, without manually inspecting or running additional tools to check sources, the risk of such errors slipping through is high. this patch renames the syscalls on 32-bit archs to clock_gettime32 and gettimeofday_time32, so that applications using the original names will fail to build without being fixed. note that there are a number of other syscalls that may also be unsafe to use directly after the time64 switchover, but (1) these are the main two that seem to be in widespread use, and (2) most of the others continue to have valid usage with a null timeval/timespec argument, as the argument is an optional timeout or similar.
* move stage3_func typedef out of shared internal dynlink.h headerRich Felker2019-12-311-1/+0
| | | | this interface contract is entirely internal to dynlink.c.
* implement SO_TIMESTAMP[NS] fallback for kernels without time64 versionsRich Felker2019-12-171-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the definitions of SO_TIMESTAMP* changed on 32-bit archs in commit 38143339646a4ccce8afe298c34467767c899f51 to the new versions that provide 64-bit versions of timeval/timespec structure in control message payload. socket options, being state attached to the socket rather than function calls, are not trivial to implement as fallbacks on ENOSYS, and support for them was initially omitted on the assumption that the ioctl-based polling alternatives (SIOCGSTAMP*) could be used instead by applications if setsockopt fails. unfortunately, it turns out that SO_TIMESTAMP is sufficiently old and widely supported that a number of applications assume it's available and treat errors as fatal. this patch introduces emulation of SO_TIMESTAMP[NS] on pre-time64 kernels by falling back to setting the "_OLD" (time32) versions of the options if the time64 ones are not recognized, and performing translation of the SCM_TIMESTAMP[NS] control messages in recvmsg. since recvmsg does not know whether its caller is legacy time32 code or time64, it performs translation for any SCM_TIMESTAMP[NS]_OLD control messages it sees, leaving the original time32 timestamp as-is (it can't be rewritten in-place anyway, and memmove would be mildly expensive) and appending the converted time64 control message at the end of the buffer. legacy time32 callers will see the converted one as a spurious control message of unknown type; time64 callers running on pre-time64 kernels will see the original one as a spurious control message of unknown type. a time64 caller running on a kernel with native time64 support will only see the time64 version of the control message. emulation of SO_TIMESTAMPING is not included at this time since (1) applications which use it seem to be prepared for the possibility that it's not present or working, and (2) it can also be used in sendmsg control messages, in a manner that looks complex to emulate completely, and costly even when running on a time64-supporting kernel. corresponding changes in recvmmsg are not made at this time; they will be done separately.
* fix incorrect use of fabs on long double operand in floatscan.cRich Felker2019-10-181-4/+1
| | | | | | | | | | | | | | based on patch by Dan Gohman, who caught this via compiler warnings. analysis by Szabolcs Nagy determined that it's a bug, whereby errno can be set incorrectly for values where the coercion from long double to double causes rounding. it seems likely that floating point status flags may be set incorrectly as a result too. at the same time, clean up use of preprocessor concatenation involving LDBL_MANT_DIG, which spuriously depends on it being a single unadorned decimal integer literal, and instead use the equivalent formulation 2/LDBL_EPSILON. an equivalent change on the printf side was made in commit bff6095d915f3e41206e47ea2a570ecb937ef926.
* remove remaining traces of __tls_get_newSzabolcs Nagy2019-09-291-1/+0
| | | | | | | | | | | | | | | | Some declarations of __tls_get_new were left in the code, even though the definition got removed in commit 9d44b6460ab603487dab4d916342d9ba4467e6b9 install dynamic tls synchronously at dlopen, streamline access this can make the build fail with ld: lib/libc.so: hidden symbol `__tls_get_new' isn't defined when libc.so is linked without --gc-sections, because a .hidden declaration in asm code creates a reference even if the symbol is not actually used.
* add support for powerpc/powerpc64 unaligned relocationsSamuel Holland2019-08-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | R_PPC_UADDR32 (R_PPC64_UADDR64) has the same meaning as R_PPC_ADDR32 (R_PPC64_ADDR64), except that its address need not be aligned. For powerpc64, BFD ld(1) will automatically convert between ADDR<->UADDR relocations when the address is/isn't at its native alignment. This will happen if, for example, there is a pointer in a packed struct. gold and lld do not currently generate R_PPC64_UADDR64, but pass through misaligned R_PPC64_ADDR64 relocations from object files, possibly relaxing them to misaligned R_PPC64_RELATIVE. In both cases (relaxed or not) this violates the PSABI, which defines the relevant field type as "a 64-bit field occupying 8 bytes, the alignment of which is 8 bytes unless otherwise specified." All three linkers violate the PSABI on 32-bit powerpc, where the only difference is that the field is 32 bits wide, aligned to 4 bytes. Currently musl fails to load executables linked by BFD ld containing R_PPC64_UADDR64, with the error "unsupported relocation type 43". This change provides compatibility with BFD ld on powerpc64, and any static linker on either architecture that starts following the PSABI more closely.
* ioctl: add fallback for new time64 SIOCGSTAMP[NS]Rich Felker2019-07-311-0/+7
| | | | | | | | | | | without this, the SIOCGSTAMP and SIOCGSTAMPNS ioctl commands, for obtaining timestamps, would stop working on pre-5.1 kernels after time_t is switched to 64-bit and their values are changed to the new time64 versions. new code is written such that it's statically unreachable on 64-bit archs, and on existing 32-bit archs until the macro values are changed to activate 64-bit time_t.
* get/setsockopt: add fallback for new time64 SO_RCVTIMEO/SO_SNDTIMEORich Felker2019-07-311-0/+7
| | | | | | | | | | without this, the SO_RCVTIMEO and SO_SNDTIMEO socket options would stop working on pre-5.1 kernels after time_t is switched to 64-bit and their values are changed to the new time64 versions. new code is written such that it's statically unreachable on 64-bit archs, and on existing 32-bit archs until the macro values are changed to activate 64-bit time_t.
* make __socketcall analogous to __syscall, error-returningRich Felker2019-07-311-6/+6
| | | | | | | | | | | the __socketcall and __socketcall_cp macros are remnants from a really old version of the syscall-mechanism infrastructure, and don't follow the pattern that the "__" version of the macro returns the raw negated error number rather than setting errno and returning -1. for time64 purposes, some socket syscalls will need to operate on the error value rather than returning immediately, so fix this up so they can use it.
* internally, define plain syscalls, if missing, as their time64 variantsRich Felker2019-07-271-0/+83
| | | | | | | | | | | | | this commit has no effect whatsoever right now, but is in preparation for a future riscv32 port and other future 32-bit archs that will be "time64-only" from the start on the kernel side. together with the previous x32 changes, this commit ensures that syscall call points that don't care about time (passing null timeouts, etc.) can continue to do so without having to special-case time64-only archs, and allows code using the time64 syscalls to uniformly test for the need to fallback with SYS_foo != SYS_foo_time64, rather than needing to check defined(SYS_foo) && SYS_foo != SYS_foo_time64.
* do not use _Noreturn for a function pointer in dynamic linkerMatthew Maurer2019-06-211-1/+1
| | | | | _Noreturn is a C11 construct, and may only be used at the site of a function definition.
* allow archs to provide a 7-argument syscall if neededRich Felker2019-05-051-0/+1
| | | | | | | | commit 788d5e24ca19c6291cebd8d1ad5b5ed6abf42665 noted that we could add this if needed, and in fact it is needed, but not for one of the archs documented as having a 7th syscall arg register. rather, it's needed for mips (o32), where all but the first 4 arguments are passed on the stack, and the stack can accommodate a 7th.
* make new math code compatible with unused variable warning/errorRich Felker2019-04-201-3/+6
| | | | | | | | | | | | | | | | commit b50d315fd23f0fbc4c11e2583801dd123d933745 introduced fp_force_eval implemented by default with a dead store to a volatile variable. unfortunately introduces warnings with -Wunused-variable and breaks the ability to use -Werror with the default warning options set by configure when warnings are enabled. we could just call fp_barrier instead, but that results in a spurious load after the store due to volatile semantics. the fix committed here avoids the load. it will still produce warnings without -Wno-unused-but-set-variable, but that's part of our default warning profile, and there are already other locations in the source where an unused variable warning will occur without it.
* math: new powSzabolcs Nagy2019-04-171-0/+1
| | | | | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc The underflow exception is signaled if the result is in the subnormal range even if the result is exact. code size change: +3421 bytes. benchmark on x86_64 before, after, speedup: -Os: pow rthruput: 102.96 ns/call 33.38 ns/call 3.08x pow latency: 144.37 ns/call 54.75 ns/call 2.64x -O3: pow rthruput: 98.91 ns/call 32.79 ns/call 3.02x pow latency: 138.74 ns/call 53.78 ns/call 2.58x
* math: new powfSzabolcs Nagy2019-04-171-0/+6
| | | | | | | | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc POWF_SCALE != 1.0 case only matters if TOINT_INTRINSICS is set, which is currently not supported for any target. SNaN is not supported, it would require an issignalingf implementation. code size change: -816 bytes. benchmark on x86_64 before, after, speedup: -Os: powf rthruput: 95.14 ns/call 20.04 ns/call 4.75x powf latency: 137.00 ns/call 34.98 ns/call 3.92x -O3: powf rthruput: 92.48 ns/call 13.67 ns/call 6.77x powf latency: 131.11 ns/call 35.15 ns/call 3.73x
* math: new exp2f and expfSzabolcs Nagy2019-04-171-0/+16
| | | | | | | | | | | | | | | | | | | | | | from https://github.com/ARM-software/optimized-routines, commit 04884bd04eac4b251da4026900010ea7d8850edc In expf TOINT_INTRINSICS is kept, but is unused, it would require support for __builtin_round and __builtin_lround as single instruction. code size change: +94 bytes. benchmark on x86_64 before, after, speedup: -Os: expf rthruput: 9.19 ns/call 8.11 ns/call 1.13x expf latency: 34.19 ns/call 18.77 ns/call 1.82x exp2f rthruput: 5.59 ns/call 6.52 ns/call 0.86x exp2f latency: 17.93 ns/call 16.70 ns/call 1.07x -O3: expf rthruput: 9.12 ns/call 4.92 ns/call 1.85x expf latency: 34.44 ns/call 18.99 ns/call 1.81x exp2f rthruput: 5.58 ns/call 4.49 ns/call 1.24x exp2f latency: 17.95 ns/call 16.94 ns/call 1.06x
* math: add configuration macrosSzabolcs Nagy2019-04-171-0/+5
| | | | | | | Musl currently aims to support non-nearest rounding mode and does not support SNaNs. These macros allow marking relevant code paths in case these decisions are changed later (they also help documenting the corner cases involved).
* math: add macros for static branch prediction hintsSzabolcs Nagy2019-04-171-0/+9
| | | | | | | | These don't have an effectw with -Os so not useful with default settings other than documenting the expectation. With --enable-optimize=internal,malloc,string,math the libc.so code size increases by 18K on x86_64 and performance varies in -2% .. +10%.
* math: add double precision error handling functionsSzabolcs Nagy2019-04-171-0/+5
|
* math: add single precision error handling functionsSzabolcs Nagy2019-04-171-0/+7
| | | | | | | | | | These are supposed to be used in tail call positions when handling special cases in new code. (fp exceptions may be raised "naturally" by the common code path if special casing is more effort.) This implements the error handling apis used in https://github.com/ARM-software/optimized-routines without errno setting.
* math: add eval_as_float and eval_as_doubleSzabolcs Nagy2019-04-171-0/+17
| | | | | | | | | | | | | | | | | | Previously type casts or assignments were used for handling excess precision, which assumed standard C99 semantics, but since it's a rarely needed obscure detail, it's better to use explicit helper functions to document where we rely on this. It also helps if the code is used outside of the libc in non-C99 compilation mode: with the default excess precision handling of gcc, explicit inline asm barriers are needed for narrowing on FLT_EVAL_METHOD!=0 targets. I plan to use this in new code with the existing style that uses double_t and float_t as much as possible. One ugliness is that it is required for almost every return statement since that does not drop excess precision (the standard changed this in C11 annex F, but that does not help in non-standard compilation modes or with old compilers).
* math: add fp_arch.h with fp_barrier and fp_force_evalSzabolcs Nagy2019-04-171-6/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | C99 has ways to support fenv access, but compilers don't implement it and assume nearest rounding mode and no fp status flag access. (gcc has -frounding-math and then it does not assume nearest rounding mode, but it still assumes the compiled code itself does not change the mode. Even if the C99 mechanism was implemented it is not ideal: it requires all code in the library to be compiled with FENV_ACCESS "on" to make it usable in non-nearest rounding mode, but that limits optimizations more than necessary.) The math functions should give reasonable results in all rounding modes (but the quality may be degraded in non-nearest rounding modes) and the fp status flag settings should follow the spec, so fenv side-effects are important and code transformations that break them should be prevented. Unfortunately compilers don't give any help with this, the best we can do is to add fp barriers to the code using volatile local variables (they create a stack frame and undesirable memory accesses to it) or inline asm (gcc specific, requires target specific fp reg constraints, often creates unnecessary reg moves and multiple barriers are needed to express that an operation has side-effects) or extern call (only useful in tail-call position to avoid stack-frame creation and does not work with lto). We assume that in a math function if an operation depends on the input and the output depends on it, then the operation will be evaluated at runtime when the function is called, producing all the expected fenv side-effects (this is not true in case of lto and in case the operation is evaluated with excess precision that is not rounded away). So fp barriers are needed (1) to prevent the move of an operation within a function (in case it may be moved from an unevaluated code path into an evaluated one or if it may be moved across a fenv access), (2) force the evaluation of an operation for its side-effect when it has no input dependency (may be constant folded) or (3) when its output is unused. I belive that fp_barrier and fp_force_eval can take care of these and they should not be needed in hot code paths.
* math: remove sun copyright from libm.hSzabolcs Nagy2019-04-171-23/+0
| | | | | | | Nothing is left from the original fdlibm header nor from the bsd modifications to it other than some internal api declarations. Comments are dropped that may be copyrightable content.