about summary refs log tree commit diff
path: root/src/thread
Commit message (Collapse)AuthorAgeFilesLines
* eliminate use of SHARED macro in __tls_get_addrRich Felker2015-11-111-6/+6
| | | | | | this was only a tiny optimization, and static-linked binaries should not be calling __tls_get_addr anyway since the linker is supposed to perform relaxation, resulting in use of the local-exec TLS model.
* eliminate use of SHARED macro to suppress visibility attributesRich Felker2015-11-112-6/+0
| | | | | | | | | | | | | | | | this is the first and simplest stage of removal of the SHARED macro, which will eventually allow libc.a and libc.so to be produced from the same object files. the original motivation for these #ifdefs which are now being removed was to allow building a static-only libc using a compiler that does not support visibility. however, SHARED was the wrong condition to test for this anyway; various assembly-language sources refer to hidden symbols and declare them with the .hidden directive, making it wrong to define the referenced symbols as non-hidden. if there is a need in the future to build libc using compilers that lack visibility, support could be moved to the build system or perhaps the __PIC__ macro could be checked instead of SHARED.
* explicitly assemble all arm asm sources as UALRich Felker2015-11-103-0/+3
| | | | | | | | these files are all accepted as legacy arm syntax when producing arm code, but legacy syntax cannot be used for producing thumb2 with access to the full ISA. even after switching to UAL, some asm source files contain instructions which are not valid in thumb mode, so these will need to be addressed separately.
* remove non-working pre-armv4t support from arm asmRich Felker2015-11-092-4/+0
| | | | | | | | | | | | | | | the idea of the three-instruction sequence being removed was to be able to return to thumb code when used on armv4t+ from a thumb caller, but also to be able to run on armv4 without the bx instruction available (in which case the low bit of lr would always be 0). however, without compiler support for generating such a sequence from C code, which does not exist and which there is unlikely to be interest in implementing, there is little point in having it in the asm, and it would likely be easier to add pre-armv4t support via enhanced linker handling of R_ARM_V4BX than at the compiler level. removing this code simplifies adding support for building libc in thumb2-only form (for cortex-m).
* use explicit __cp_cancel label in cancellable syscall asm for all archsRich Felker2015-11-028-28/+32
| | | | | | | | | | | | | previously, only archs that needed to do stack cleanup defined a __cp_cancel label for acting on cancellation in their syscall asm, and a default definition was provided by a weak alias to __cancel, the C function. this resulted in wrong codegen for arm on gcc versions affected by pr 68178 and possibly similar issues (like pr 66609) on other archs, and also created an inconsistency where the __cp_begin and __cp_end labels were treated as const data but __cp_cancel was treated as a function. this in turn caused incorrect code generation on archs where function pointers point to function descriptors rather than code (for now, only sh/fdpic).
* properly access mcontext_t program counter in cancellation handlerRich Felker2015-11-021-3/+4
| | | | | | | | | using the actual mcontext_t definition rather than an overlaid pointer array both improves correctness/readability and eliminates some ugly hacks for archs with 64-bit registers bit 32-bit program counter. also fix UB due to comparison of pointers not in a common array object.
* add missing memory barrier to pthread_joinBobby Bingham2015-10-151-0/+1
| | | | | | | POSIX requires pthread_join to synchronize memory on success. The futex wait inside __timedwait_cp cannot handle this because it's not called in all cases. Also, in the case of a spurious wake, tid can become zero between the wake and when the joining thread checks it.
* make sh clone asm fdpic-compatibleRich Felker2015-09-121-3/+9
| | | | | | | | | | | clone calls back to a function pointer provided by the caller, which will actually be a pointer to a function descriptor on fdpic. the obvious solution is to have a separate version of clone for fdpic, but I have taken a simpler approach to go around the problem. instead of calling the pointed-to function from asm, a direct call is made to an internal C function which then calls the pointed-to function. this lets the C compiler generate the appropriate calling convention for an indirect call with no need for ABI-specific assembly.
* fix local-dynamic model TLS on mips and powerpcRich Felker2015-06-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | the TLS ABI spec for mips, powerpc, and some other (presently unsupported) RISC archs has the return value of __tls_get_addr offset by +0x8000 and the result of DTPOFF relocations offset by -0x8000. I had previously assumed this part of the ABI was actually just an implementation detail, since the adjustments cancel out. however, when the local dynamic model is used for accessing TLS that's known to be in the same DSO, either of the following may happen: 1. the -0x8000 offset may already be applied to the argument structure passed to __tls_get_addr at ld time, without any opportunity for runtime relocations. 2. __tls_get_addr may be used with a zero offset argument to obtain a base address for the module's TLS, to which the caller then applies immediate offsets for individual objects accessed using the local dynamic model. since the immediate offsets have the -0x8000 adjustment applied to them, the base address they use needs to include the +0x8000 offset. it would be possible, but more complex, to store the pointers in the dtv[] array with the +0x8000 offset pre-applied, to avoid the runtime cost of adding 0x8000 on each call to __tls_get_addr. this change could be made later if measurements show that it would help.
* work around mips detached thread exit breakage due to kernel regressionRich Felker2015-06-201-0/+1
| | | | | | | | | linux kernel commit 46e12c07b3b9603c60fc1d421ff18618241cb081 caused the mips syscall mechanism to fail with EFAULT when the userspace stack pointer is invalid, breaking __unmapself used for detached thread exit. the workaround is to set $sp to a known-valid, readable address, and the simplest one to obtain is the address of the current function, which is available (per o32 calling convention) in $25.
* ignore ENOSYS error from mprotect in pthread_create and dynamic linkerRich Felker2015-06-171-1/+2
| | | | | this error simply indicated a system without memory protection (NOMMU) and should not cause failure in the caller.
* switch to using trap number 31 for syscalls on shRich Felker2015-06-163-5/+5
| | | | | | | | | | | | | | | | | | | nominally the low bits of the trap number on sh are the number of syscall arguments, but they have never been used by the kernel, and some code making syscalls does not even know the number of arguments and needs to pass an arbitrary high number anyway. sh3/sh4 traditionally used the trap range 16-31 for syscalls, but part of this range overlapped with hardware exceptions/interrupts on sh2 hardware, so an incompatible range 32-47 was chosen for sh2. using trap number 31 everywhere, since it's in the existing sh3/sh4 range and does not conflict with sh2 hardware, is a proposed unification of the kernel syscall convention that will allow binaries to be shared between sh2 and sh3/sh4. if this is not accepted into the kernel, we can refit the sh2 target with runtime selection mechanisms for the trap number, but doing so would be invasive and would entail non-trivial overhead.
* switch sh port's __unmapself to generic version when running on sh2/nommuRich Felker2015-06-161-3/+3
| | | | | | | | | | | | | due to the way the interrupt and syscall trap mechanism works, userspace on sh2 must never set the stack pointer to an invalid value. thus, the approach used on most archs, where __unmapself executes with no stack for the interval between SYS_munmap and SYS_exit, is not viable on sh2. in order not to pessimize sh3/sh4, the sh asm version of __unmapself is not removed. instead it's renamed and redirected through code that calls either the generic (safe) __unmapself or the sh3/sh4 asm, depending on compile-time and run-time conditions.
* add support for sh2 interrupt-masking-based atomics to sh portRich Felker2015-06-161-6/+0
| | | | | | | | | | | | | | | | | | | the sh2 target is being considered an ISA subset of sh3/sh4, in the sense that binaries built for sh2 are intended to be usable on later cpu models/kernels with mmu support. so rather than hard-coding sh2-specific atomics, the runtime atomic selection mechanisms that was already in place has been extended to add sh2 atomics. at this time, the sh2 atomics are not SMP-compatible; since the ISA lacks actual atomic operations, the new code instead masks interrupts for the duration of the atomic operation, producing an atomic result on single-core. this is only possible because the kernel/hardware does not impose protections against userspace doing so. additional changes will be needed to support future SMP systems. care has been taken to avoid producing significant additional code size in the case where it's known at compile-time that the target is not sh2 and does not need sh2-specific code.
* refactor stdio open file list handling, move it out of global libc structRich Felker2015-06-161-1/+2
| | | | | | | | | | | | | functions which open in-memory FILE stream variants all shared a tail with __fdopen, adding the FILE structure to stdio's open file list. replacing this common tail with a function call reduces code size and duplication of logic. the list is also partially encapsulated now. function signatures were chosen to facilitate tail call optimization and reduce the need for additional accessor functions. with these changes, static linked programs that do not use stdio no longer have an open file list at all.
* implement arch-generic version of __unmapselfRich Felker2015-06-101-0/+29
| | | | | | | | | | | | this can be used to put off writing an asm version of __unmapself for new archs, or as a permanent solution on archs where it's not practical or even possible to run momentarily with no stack. the concept here is simple: the caller takes a lock on a global shared stack and uses it to make the munmap and exit syscalls. the only trick is unlocking, which must be done after the thread exits, and this is achieved by using the set_tid_address syscall to have the kernel zero and futex-wake the lock word as part of the exit syscall.
* mark mips cancellable syscall code as codeRich Felker2015-05-251-0/+3
| | | | otherwise disassemblers treat it as data.
* eliminate costly tricks to avoid TLS access for current locale stateRich Felker2015-05-161-6/+0
| | | | | | | | | | | | | | | the code being removed used atomics to track whether any threads might be using a locale other than the current global locale, and whether any threads might have abstract 8-bit (non-UTF-8) LC_CTYPE active, a feature which was never committed (still pending). the motivations were to support early execution prior to setup of the thread pointer, to partially support systems (ancient kernels) where thread pointer setup is not possible, and to avoid high performance cost on archs where accessing the thread pointer may be very slow. since commit 19a1fe670acb3ab9ead0fe31859ca7d4fe40dd54, the thread pointer is always available, so these hacks are no longer needed. removing them greatly simplifies the affected code.
* in i386 __set_thread_area, don't assume %gs register is initially zeroRich Felker2015-05-161-4/+9
| | | | | | | | | | | | | | | | | | | | | | | commit f630df09b1fd954eda16e2f779da0b5ecc9d80d3 added logic to handle the case where __set_thread_area is called more than once by reusing the GDT slot already in the %gs register, and only setting up a new GDT slot when %gs is zero. this created a hidden assumption that %gs is zero when a new process image starts, which is true in practice on Linux, but does not seem to be documented ABI, and fails to hold under qemu app-level emulation. while it would in theory be possible to zero %gs in the entry point code, this code is shared between static and dynamic binaries, and dynamic binaries must not clobber the value of %gs already setup by the dynamic linker. the alternative solution implemented in this commit simply uses global data to store the GDT index that's selected. __set_thread_area should only be called in the initial thread anyway (subsequent threads get their thread pointer setup by __clone), but even if it were called by another thread, it would simply read and write back the same GDT index that was already assigned to the initial thread, and thus (in the x86 memory model) there is no data race.
* fix stack protector crashes on x32 & powerpc due to misplaced TLS canaryRich Felker2015-05-061-1/+1
| | | | | | | | | | | | | | | | | | | i386, x86_64, x32, and powerpc all use TLS for stack protector canary values in the default stack protector ABI, but the location only matched the ABI on i386 and x86_64. on x32, the expected location for the canary contained the tid, thus producing spurious mismatches (resulting in process termination) upon fork. on powerpc, the expected location contained the stdio_locks list head, so returning from a function after calling flockfile produced spurious mismatches. in both cases, the random canary was not present, and a predictable value was used instead, making the stack protector hardening much less effective than it should be. in the current fix, the thread structure has been expanded to have canary fields at all three possible locations, and archs that use a non-default location must define a macro in pthread_arch.h to choose which location is used. for most archs (which lack TLS canary ABI) the choice does not matter.
* fix x32 __set_thread_area failure due to junk in upper bitsRich Felker2015-05-021-1/+1
| | | | | the kernel does not properly clear the upper bits of the syscall argument, so we have to do it before the syscall.
* minor optimization to pthread_spin_trylockRich Felker2015-04-222-2/+4
| | | | | | use CAS instead of swap since it's lighter for most archs, and keep EBUSY in the lock value so that the old value obtained by CAS can be used directly as the return value for pthread_spin_trylock.
* optimize spin lock not to dirty cache line while spinningRich Felker2015-04-221-1/+1
|
* fix mmap leak in sem_open failure path for link callRich Felker2015-04-211-0/+1
| | | | | | | the leak was found by static analysis (reported by Alexander Monakov), not tested/observed, but seems to have occured both when failing due to O_EXCL, and in a race condition with O_CREAT but not O_EXCL where a semaphore by the same name was created concurrently.
* make dlerror state and message thread-local and dynamically-allocatedRich Felker2015-04-181-0/+2
| | | | | | | | | this fixes truncation of error messages containing long pathnames or symbol names. the dlerror state was previously required by POSIX to be global. the resolution of bug 97 relaxed the requirements to allow thread-safe implementations of dlerror with thread-local state and message buffer.
* fix sh build regressions in asmRich Felker2015-04-171-1/+1
| | | | | even hidden functions need @PLT symbol references; otherwise an absolute address is produced instead of a PC-relative one.
* fix sh __set_thread_area uninitialized return valueRich Felker2015-04-171-1/+2
| | | | | this caused the dynamic linker/startup code to abort when r0 happened to contain a negative value.
* use hidden __tls_get_new for tls/tlsdesc lookup fallback casesRich Felker2015-04-141-1/+3
| | | | | | | | | | | | | | | previously, the dynamic tlsdesc lookup functions and the i386 special-ABI ___tls_get_addr (3 underscores) function called __tls_get_addr when the slot they wanted was not already setup; __tls_get_addr would then in turn also see that it's not setup and call __tls_get_new. calling __tls_get_new directly is both more efficient and avoids the issue of calling a non-hidden (public API/ABI) function from asm. for the special i386 function, a weak reference to __tls_get_new is used since this function is not defined when static linking (the code path that needs it is unreachable in static-linked programs).
* cleanup use of visibility attributes in pthread_cancel.cRich Felker2015-04-141-8/+9
| | | | | | | applying the attribute to a weak_alias macro was a hack. instead use a separate declaration to apply the visibility, and consolidate declarations together to avoid having visibility mess all over the file.
* fix inconsistent visibility for internal syscall symbolsRich Felker2015-04-141-0/+5
|
* consistently use hidden visibility for cancellable syscall internalsRich Felker2015-04-1411-30/+96
| | | | | | | | | | in a few places, non-hidden symbols were referenced from asm in ways that assumed ld-time binding. while these is no semantic reason these symbols need to be hidden, fixing the references without making them hidden was going to be ugly, and hidden reduces some bloat anyway. in the asm files, .global/.hidden directives have been moved to the top to unclutter the actual code.
* fix inconsistent visibility for internal __tls_get_new functionRich Felker2015-04-141-3/+2
| | | | | | at the point of call it was declared hidden, but the definition was not hidden. for some toolchains this inconsistency produced textrels without ld-time binding.
* remove remnants of support for running in no-thread-pointer modeRich Felker2015-04-134-11/+5
| | | | | | | | | | | | | since 1.1.0, musl has nominally required a thread pointer to be setup. most of the remaining code that was checking for its availability was doing so for the sake of being usable by the dynamic linker. as of commit 71f099cb7db821c51d8f39dfac622c61e54d794c, this is no longer necessary; the thread pointer is now valid before any libc code (outside of dynamic linker bootstrap functions) runs. this commit essentially concludes "phase 3" of the "transition path for removing lazy init of thread pointer" project that began during the 1.1.0 release cycle.
* allow i386 __set_thread_area to be called more than onceRich Felker2015-04-131-1/+5
| | | | | | | | previously a new GDT slot was requested, even if one had already been obtained by a previous call. instead extract the old slot number from GS and reuse it if it was already set. the formula (GS-3)/8 for the slot number automatically yields -1 (request for new slot) if GS is zero (unset).
* remove mismatched arguments from vmlock function definitionsRich Felker2015-04-111-2/+2
| | | | | commit f08ab9e61a147630497198fe3239149275c0a3f4 introduced these accidentally as remnants of some work I tried that did not work out.
* apply vmlock wait to __unmapself in pthread_exitRich Felker2015-04-101-0/+4
|
* redesign and simplify vmlock systemRich Felker2015-04-105-30/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | this global lock allows certain unlock-type primitives to exclude mmap/munmap operations which could change the identity of virtual addresses while references to them still exist. the original design mistakenly assumed mmap/munmap would conversely need to exclude the same operations which exclude mmap/munmap, so the vmlock was implemented as a sort of 'symmetric recursive rwlock'. this turned out to be unnecessary. commit 25d12fc0fc51f1fae0f85b4649a6463eb805aa8f already shortened the interval during which mmap/munmap held their side of the lock, but left the inappropriate lock design and some inefficiency. the new design uses a separate function, __vm_wait, which does not hold any lock itself and only waits for lock users which were already present when it was called to release the lock. this is sufficient because of the way operations that need to be excluded are sequenced: the "unlock-type" operations using the vmlock need only block mmap/munmap operations that are precipitated by (and thus sequenced after) the atomic-unlock they perform while holding the vmlock. this allows for a spectacular lack of synchronization in the __vm_wait function itself.
* optimize out setting up robust list with kernel when not neededRich Felker2015-04-102-6/+5
| | | | | | | | | | as a result of commit 12e1e324683a1d381b7f15dd36c99b37dd44d940, kernel processing of the robust list is only needed for process-shared mutexes. previously the first attempt to lock any owner-tracked mutex resulted in robust list initialization and a set_robust_list syscall. this is no longer necessary, and since the kernel's record of the robust list must now be cleared at thread exit time for detached threads, optimizing it out is more worthwhile than before too.
* process robust list in pthread_exit to fix detached thread use-after-unmapRich Felker2015-04-102-26/+27
| | | | | | | | | | | | | | | | | | | | | the robust list head lies in the thread structure, which is unmapped before exit for detached threads. this leaves the kernel unable to process the exiting thread's robust list, and with a dangling pointer which may happen to point to new unrelated data at the time the kernel processes it. userspace processing of the robust list was already needed for non-pshared robust mutexes in order to perform private futex wakes rather than the shared ones the kernel would do, but it was conditional on linking pthread_mutexattr_setrobust and did not bother processing the pshared mutexes in the list, which requires additional logic for the robust list pending slot in case pthread_exit is interrupted by asynchronous process termination. the new robust list processing code is linked unconditionally (inlined in pthread_exit), handles both private and shared mutexes, and also removes the kernel's reference to the robust list before unmapping and exit if the exiting thread is detached.
* block all signals (even internal ones) in cancellation signal handlerRich Felker2015-03-161-1/+2
| | | | | | | | | | | previously the implementation-internal signal used for multithreaded set*id operations was left unblocked during handling of the cancellation signal. however, on some archs, signal contexts are huge (up to 5k) and the possibility of nested signal handlers drastically increases the minimum stack requirement. since the cancellation signal handler will do its job and return in bounded time before possibly passing execution to application code, there is no need to allow other signals to interrupt it.
* add aarch64 portSzabolcs Nagy2015-03-114-0/+69
| | | | | | | | | | This adds complete aarch64 target support including bigendian subarch. Some of the long double math functions are known to be broken otherwise interfaces should be fully functional, but at this point consider this port experimental. Initial work on this port was done by Sireesh Tripurari and Kevin Bortis.
* fix regression in pthread_cond_wait with cancellation disabledRich Felker2015-03-071-0/+1
| | | | | | due to a logic error in the use of masked cancellation mode, pthread_cond_wait did not honor PTHREAD_CANCEL_DISABLE but instead failed with ECANCELED when cancellation was pending.
* fix signed left-shift overflow in pthread_condattr_setpsharedRich Felker2015-03-041-1/+1
|
* make all objects used with atomic operations volatileRich Felker2015-03-039-16/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the memory model we use internally for atomics permits plain loads of values which may be subject to concurrent modification without requiring that a special load function be used. since a compiler is free to make transformations that alter the number of loads or the way in which loads are performed, the compiler is theoretically free to break this usage. the most obvious concern is with atomic cas constructs: something of the form tmp=*p;a_cas(p,tmp,f(tmp)); could be transformed to a_cas(p,*p,f(*p)); where the latter is intended to show multiple loads of *p whose resulting values might fail to be equal; this would break the atomicity of the whole operation. but even more fundamental breakage is possible. with the changes being made now, objects that may be modified by atomics are modeled as volatile, and the atomic operations performed on them by other threads are modeled as asynchronous stores by hardware which happens to be acting on the request of another thread. such modeling of course does not itself address memory synchronization between cores/cpus, but that aspect was already handled. this all seems less than ideal, but it's the best we can do without mandating a C11 compiler and using the C11 model for atomics. in the case of pthread_once_t, the ABI type of the underlying object is not volatile-qualified. so we are assuming that accessing the object through a volatile-qualified lvalue via casts yields volatile access semantics. the language of the C standard is somewhat unclear on this matter, but this is an assumption the linux kernel also makes, and seems to be the correct interpretation of the standard.
* suppress masked cancellation in pthread_joinRich Felker2015-03-021-1/+5
| | | | | | like close, pthread_join is a resource-deallocation function which is also a cancellation point. the intent of masked cancellation mode is to exempt such functions from failure with ECANCELED.
* fix namespace issue in pthread_join affecting thrd_joinRich Felker2015-03-021-1/+2
| | | | | | pthread_testcancel is not in the ISO C reserved namespace and thus cannot be used here. use the namespace-protected version of the function instead.
* factor cancellation cleanup push/pop out of futex __timedwait functionRich Felker2015-03-027-24/+21
| | | | | | | | | | | | | previously, the __timedwait function was optionally a cancellation point depending on whether it was passed a pointer to a cleaup function and context to register. as of now, only one caller actually used such a cleanup function (and it may face removal soon); most callers either passed a null pointer to disable cancellation or a dummy cleanup function. now, __timedwait is never a cancellation point, and __timedwait_cp is the cancellable version. this makes the intent of the calling code more obvious and avoids ugly dummy functions and long argument lists.
* fix failure of internal futex __timedwait to report ECANCELEDRich Felker2015-02-271-1/+1
| | | | | | | | | | as part of abstracting the futex wait, this function suppresses all futex error values which callers should not see using a whitelist approach. when the masked cancellation mode was added, the new ECANCELED error was not whitelisted. this omission caused the new pthread_cond_wait code using masked cancellation to exhibit a spurious wake (rather than acting on cancellation) when the request arrived after blocking on the cond var.
* fix breakage in pthread_cond_wait due to typoRich Felker2015-02-231-1/+1
| | | | | | | | | | | | due to accidental use of = instead of ==, the error code was always set to zero in the signaled wake case for non-shared cv waits. suppressing ETIMEDOUT (the only possible wait error) is harmless and actually permitted in this case, but suppressing mutex errors could give the caller false information about the state of the mutex. commit 8741ffe625363a553e8f509dc3ca7b071bdbab47 introduced this regression and commit d9da1fb8c592469431c764732d09f7756340190e preserved it when reorganizing the code.
* simplify cond var code now that cleanup handler is not neededRich Felker2015-02-221-86/+63
|