about summary refs log tree commit diff
path: root/sysdeps/aarch64
Commit message (Collapse)AuthorAgeFilesLines
* Remove TLS_TCB_ALIGN and TLS_INIT_TCB_ALIGNFlorian Weimer2021-12-091-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TLS_INIT_TCB_ALIGN is not actually used. TLS_TCB_ALIGN was likely introduced to support a configuration where the thread pointer has not the same alignment as THREAD_SELF. Only ia64 seems to use that, but for the stack/pointer guard, not for storing tcbhead_t. Some ports use TLS_TCB_OFFSET and TLS_PRE_TCB_SIZE to shift the thread pointer, potentially landing in a different residue class modulo the alignment, but the changes should not impact that. In general, given that TLS variables have their own alignment requirements, having different alignment for the (unshifted) thread pointer and struct pthread would potentially result in dynamic offsets, leading to more complexity. hppa had different values before: __alignof__ (tcbhead_t), which seems to be 4, and __alignof__ (struct pthread), which was 8 (old default) and is now 32. However, it defines THREAD_SELF as: /* Return the thread descriptor for the current thread. */ # define THREAD_SELF \ ({ struct pthread *__self; \ __self = __get_cr27(); \ __self - 1; \ }) So the thread pointer points after struct pthread (hence __self - 1), and they have to have the same alignment on hppa as well. Similarly, on ia64, the definitions were different. We have: # define TLS_PRE_TCB_SIZE \ (sizeof (struct pthread) \ + (PTHREAD_STRUCT_END_PADDING < 2 * sizeof (uintptr_t) \ ? ((2 * sizeof (uintptr_t) + __alignof__ (struct pthread) - 1) \ & ~(__alignof__ (struct pthread) - 1)) \ : 0)) # define THREAD_SELF \ ((struct pthread *) ((char *) __thread_self - TLS_PRE_TCB_SIZE)) And TLS_PRE_TCB_SIZE is a multiple of the struct pthread alignment (confirmed by the new _Static_assert in sysdeps/ia64/libc-tls.c). On m68k, we have a larger gap between tcbhead_t and struct pthread. But as far as I can tell, the port is fine with that. The definition of TCB_OFFSET is sufficient to handle the shifted TCB scenario. This fixes commit 23c77f60181eb549f11ec2f913b4270af29eee38 ("nptl: Increase default TCB alignment to 32"). Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* nptl: Introduce <tcb-access.h> for THREAD_* accessorsFlorian Weimer2021-12-091-9/+1
| | | | | | | These are common between most architectures. Only the x86 targets are outliers. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* nptl: Increase default TCB alignment to 32Florian Weimer2021-12-031-3/+0
| | | | | | | | | | rseq support will use a 32-byte aligned field in struct pthread, so the whole struct needs to have at least that alignment. nptl/tst-tls3mod.c uses TCB_ALIGNMENT, therefore include <descr.h> to obtain the fallback definition. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* AArch64: Improve A64FX memcpyWilco Dijkstra2021-12-021-321/+225
| | | | | | | | | | | | | v2 is a complete rewrite of the A64FX memcpy. Performance is improved by streamlining the code, aligning all large copies and using a single unrolled loop for all sizes. The code size for memcpy and memmove goes down from 1796 bytes to 868 bytes. Performance is better in all cases: bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33% faster for large sizes, bench-memcpy-walk is 25% faster for small sizes and 20% for the largest sizes. The geomean of all tests in bench-memcpy is 5.1% faster, and total time is reduced by 4%. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Optimize memcmpWilco Dijkstra2021-12-021-107/+134
| | | | | | | | Rewrite memcmp to improve performance. On small and medium inputs performance is 10-20% better. Large inputs use a SIMD loop processing 64 bytes per iteration, which is 30-50% faster depending on the size. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* String: Add hidden defs for __memcmpeq() to enable internal usageNoah Goldstein2021-10-261-0/+1
| | | | | | | | No bug. This commit adds hidden defs for all declarations of __memcmpeq. This enables usage of __memcmpeq without the PLT for usage internal to GLIBC.
* String: Add support for __memcmpeq() ABI on all targetsNoah Goldstein2021-10-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | No bug. This commit adds support for __memcmpeq() as a new ABI for all targets. In this commit __memcmpeq() is implemented only as an alias to the corresponding targets memcmp() implementation. __memcmpeq() is added as a new symbol starting with GLIBC_2.35 and defined in string.h with comments explaining its behavior. Basic tests that it is callable and works where added in string/tester.c As discussed in the proposal "Add new ABI '__memcmpeq()' to libc" __memcmpeq() is essentially a reserved namespace for bcmp(). The means is shares the same specifications as memcmp() except the return value for non-equal byte sequences is any non-zero value. This is less strict than memcmp()'s return value specification and can be better optimized when a boolean return is all that is needed. __memcmpeq() is meant to only be called by compilers if they can prove that the return value of a memcmp() call is only used for its boolean value. All tests in string/tester.c passed. As well build succeeds on x86_64-linux-gnu target.
* elf: Fix dynamic-link.h usage on rtld.cAdhemerval Zanella2021-10-141-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 4af6982e4c fix does not fully handle RTLD_BOOTSTRAP usage on rtld.c due two issues: 1. RTLD_BOOTSTRAP is also used on dl-machine.h on various architectures and it changes the semantics of various machine relocation functions. 2. The elf_get_dynamic_info() change was done sideways, previously to 490e6c62aa get-dynamic-info.h was included by the first dynamic-link.h include *without* RTLD_BOOTSTRAP being defined. It means that the code within elf_get_dynamic_info() that uses RTLD_BOOTSTRAP is in fact unused. To fix 1. this patch now includes dynamic-link.h only once with RTLD_BOOTSTRAP defined. The ELF_DYNAMIC_RELOCATE call will now have the relocation fnctions with the expected semantics for the loader. And to fix 2. part of 4af6982e4c is reverted (the check argument elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP pieces are removed. To reorganize the includes the static TLS definition is moved to its own header to avoid a circular dependency (it is defined on dynamic-link.h and dl-machine.h requires it at same time other dynamic-link.h definition requires dl-machine.h defitions). Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL are moved to its own header. Only ancient ABIs need special values (arm, i386, and mips), so a generic one is used as default. The powerpc Elf64_FuncDesc is also moved to its own header, since csu code required its definition (which would require either include elf/ folder or add a full path with elf/). Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32, and powerpc64le. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* elf: Avoid nested functions in the loader [BZ #27220]Fangrui Song2021-10-071-9/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | dynamic-link.h is included more than once in some elf/ files (rtld.c, dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested functions. This harms readability and the nested functions usage is the biggest obstacle prevents Clang build (Clang doesn't support GCC nested functions). The key idea for unnesting is to add extra parameters (struct link_map *and struct r_scope_elm *[]) to RESOLVE_MAP, ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a], elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired by Stan Shebs' ppc64/x86-64 implementation in the google/grte/v5-2.27/master which uses mixed extra parameters and static variables.) Future simplification: * If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM, elf_machine_runtime_setup can drop the `scope` parameter. * If TLSDESC no longer need to be in elf_machine_lazy_rel, elf_machine_lazy_rel can drop the `scope` parameter. Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32, sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64. In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu and riscv64-linux-gnu-rv64imafdc-lp64d. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: update libm test ulpsSzabolcs Nagy2021-10-051-1/+1
| | | | | | | Update after commit 6bbf7298323bf31bc43494b2201465a449778e10. Fixed inaccuracy of j0f (BZ #28185)
* aarch64: Disable A64FX memcpy/memmove BTI unconditionallyNaohiro Tamura2021-09-241-0/+3
| | | | | | | | | This patch disables A64FX memcpy/memmove BTI instruction insertion unconditionally such as A64FX memset patch [1] for performance. [1] commit 07b427296b8d59f439144029d9a948f6c1ce0a31 Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* elf: Remove THREAD_GSCOPE_IN_TCBSergey Bugaev2021-09-161-1/+0
| | | | | | | | | All the ports now have THREAD_GSCOPE_IN_TCB set to 1. Remove all support for !THREAD_GSCOPE_IN_TCB, along with the definition itself. Signed-off-by: Sergey Bugaev <bugaevc@gmail.com> Message-Id: <20210915171110.226187-4-bugaevc@gmail.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
* AArch64: Update A64FX memset not to degrade at 16KBNaohiro Tamura2021-09-061-1/+8
| | | | | | | | | | This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
* Revert "AArch64: Update A64FX memset not to degrade at 16KB"Szabolcs Nagy2021-09-061-8/+1
| | | | | | Because of wrong commit author. Will recommit it with right author. This reverts commit 23777232c23f80809613bdfa329f63aadf992922.
* Remove "Contributed by" linesSiddhesh Poyarekar2021-09-031-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | We stopped adding "Contributed by" or similar lines in sources in 2012 in favour of git logs and keeping the Contributors section of the glibc manual up to date. Removing these lines makes the license header a bit more consistent across files and also removes the possibility of error in attribution when license blocks or files are copied across since the contributed-by lines don't actually reflect reality in those cases. Move all "Contributed by" and similar lines (Written by, Test by, etc.) into a new file CONTRIBUTED-BY to retain record of these contributions. These contributors are also mentioned in manual/contrib.texi, so we just maintain this additional record as a courtesy to the earlier developers. The following scripts were used to filter a list of files to edit in place and to clean up the CONTRIBUTED-BY file respectively. These were not added to the glibc sources because they're not expected to be of any use in future given that this is a one time task: https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dc https://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02 Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* AArch64: Update A64FX memset not to degrade at 16KBNaohiro Tamura via Libc-alpha2021-09-031-1/+8
| | | | | | | | | | This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
* Remove sysdeps/*/tls-macros.hFangrui Song2021-08-181-51/+0
| | | | | | | | They provide TLS_GD/TLS_LD/TLS_IE/TLS_IE macros for TLS testing. Now that we have migrated to __thread and tls_model attributes, these macros are unused and the tls-macros.h files can retire. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* aarch64: Make elf_machine_{load_address,dynamic} robust [BZ #28203]Fangrui Song2021-08-111-15/+9
| | | | | | | | | | | | | | | | | | | | The AArch64 ABI is largely platform agnostic and does not specify _GLOBAL_OFFSET_TABLE_[0] ([1]). glibc ld.so turns out to be probably the only user of _GLOBAL_OFFSET_TABLE_[0] and GNU ld defines the value to the link-time address _DYNAMIC. [2] In 2012, __ehdr_start was implemented in GNU ld and gold in binutils 2.23. Using adrp+add / (-mcmodel=tiny) adr to access __ehdr_start/_DYNAMIC gives us a robust way to get the load address and the link-time address of _DYNAMIC. [1]: From a psABI maintainer, https://bugs.llvm.org/show_bug.cgi?id=49672#c2 [2]: LLD's aarch64 port does not set _GLOBAL_OFFSET_TABLE_[0] to the link-time address _DYNAMIC. LLD is widely used on aarch64 Android and ChromeOS devices. Software just works without the need for _GLOBAL_OFFSET_TABLE_[0]. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* [5/5] AArch64: Improve A64FX memset medium loopsWilco Dijkstra2021-08-101-26/+19
| | | | | | | Simplify the code for memsets smaller than L1. Improve the unroll8 and L1_prefetch loops. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
* [4/5] AArch64: Improve A64FX memset by removing unroll32Wilco Dijkstra2021-08-101-17/+1
| | | | | | Remove unroll32 code since it doesn't improve performance. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
* [3/5] AArch64: Improve A64FX memset for remaining bytesWilco Dijkstra2021-08-101-33/+13
| | | | | | | Simplify handling of remaining bytes. Avoid lots of taken branches and complex whilelo computations, instead unconditionally write vectors from the end. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
* [2/5] AArch64: Improve A64FX memset for large sizesWilco Dijkstra2021-08-101-60/+25
| | | | | | | | Improve performance of large memsets. Simplify alignment code. For zero memset use DC ZVA, which almost doubles performance. For non-zero memsets use the unroll8 loop which is about 10% faster. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
* [1/5] AArch64: Improve A64FX memset for small sizesWilco Dijkstra2021-08-101-60/+36
| | | | | | | | Improve performance of small memsets by reducing instruction counts and improving code alignment. Bench-memset shows 35-45% performance gain for small sizes. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
* glibc.malloc.check: Wean away from malloc hooksSiddhesh Poyarekar2021-07-221-0/+3
| | | | | | | | | | | | | | | | | | The malloc-check debugging feature is tightly integrated into glibc malloc, so thanks to an idea from Florian Weimer, much of the malloc implementation has been moved into libc_malloc_debug.so to support malloc-check. Due to this, glibc malloc and malloc-check can no longer work together; they use altogether different (but identical) structures for heap management. This should not make a difference though since the malloc check hook is not disabled anywhere. malloc_set_state does, but it does so early enough that it shouldn't cause any problems. The malloc check tunable is now in the debug DSO and has no effect when the DSO is not preloaded. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
* AArch64: Add hp-timing.hWilco Dijkstra2021-07-011-0/+47
| | | | | | | | | | Add hp-timing.h using the cntvct_el0 counter. Return timing in nanoseconds so it is fully compatible with generic hp-timing. Don't set HP_TIMING_INLINE in the dynamic linker since it adds unnecessary overheads and some ancient kernels may not handle emulating cntcvt correctly. Currently cntvct_el0 is only used for timing in the benchtests. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Improve strnlen performanceWilco Dijkstra2021-07-011-181/+89
| | | | | | | | Optimize strnlen by avoiding UMINV which is slow on most cores. On Neoverse N1 large strings are 1.8x faster than the current version, and bench-strnlen is 50% faster overall. This version is MTE compatible. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* Update math: redirect roundeven functionH.J. Lu2021-06-272-1/+2
| | | | | Redirect target specific roundeven functions for aarch64, ldbl-128ibm and riscv.
* AArch64: Add support for roundeven[f]Wilco Dijkstra2021-06-082-0/+57
| | | | | | Add inline assembler for the roundeven functions. Passes GLIBC regression. Note GCC does not inline the builtin (PR100966), so this cannot be used for now.
* aarch64: Added optimized memset for A64FXNaohiro Tamura2021-05-274-5/+286
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch optimizes the performance of memset for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill and prefetch. SVE assembler code for memset is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
* aarch64: Added optimized memcpy and memmove for A64FXNaohiro Tamura2021-05-276-13/+443
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch optimizes the performance of memcpy/memmove for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill, and software pipelining. SVE assembler code for memcpy/memmove is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
* aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTINaohiro Tamura2021-05-261-2/+7
| | | | | | | | | This patch defines BTI_C and BTI_J macros conditionally for performance. If HAVE_AARCH64_BTI is true, BTI_C and BTI_J are defined as HINT instruction for ARMv8.5 BTI (Branch Target Identification). If HAVE_AARCH64_BTI is false, both BTI_C and BTI_J are defined as NOP.
* config: Added HAVE_AARCH64_SVE_ASM for aarch64Naohiro Tamura2021-05-262-0/+43
| | | | | This patch checks if assembler supports '-march=armv8.2-a+sve' to generate SVE code or not, and then define HAVE_AARCH64_SVE_ASM macro.
* elf: Remove lazy tlsdesc relocation related codeSzabolcs Nagy2021-04-211-1/+0
| | | | | | | | | | | Remove generic tlsdesc code related to lazy tlsdesc processing since lazy tlsdesc relocation is no longer supported. This includes removing GL(dl_load_lock) from _dl_make_tlsdesc_dynamic which is only called at load time when that lock is already held. Added a documentation comment too. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: update libm test ulpsSzabolcs Nagy2021-04-081-1/+1
| | | | Update after commit 43576de04afc6a0896a3ecc094e1581069a0652a.
* aarch64: free tlsdesc data on dlclose [BZ #27403]Szabolcs Nagy2021-04-061-0/+27
| | | | | | | | DL_UNMAP_IS_SPECIAL and DL_UNMAP were not defined. The definitions are now copied from arm, since the same is needed on aarch64. The cleanup of tlsdesc data is handled by the custom _dl_unmap. Fixes bug 27403.
* Fix the inaccuracy of j0f/j1f/y0f/y1f [BZ #14469, #14470, #14471, #14472]Paul Zimmermann2021-04-021-35/+35
| | | | | | | | | | | | | | | | | | | | | | | For j0f/j1f/y0f/y1f, the largest error for all binary32 inputs is reduced to at most 9 ulps for all rounding modes. The new code is enabled only when there is a cancellation at the very end of the j0f/j1f/y0f/y1f computation, or for very large inputs, thus should not give any visible slowdown on average. Two different algorithms are used: * around the first 64 zeros of j0/j1/y0/y1, approximation polynomials of degree 3 are used, computed using the Sollya tool (https://www.sollya.org/) * for large inputs, an asymptotic formula from [1] is used [1] Fast and Accurate Bessel Function Computation, John Harrison, Proceedings of Arith 19, 2009. Inputs yielding the new largest errors are added to auto-libm-test-in, and ulps are regenerated for various targets (thanks Adhemerval Zanella). Tested on x86_64 with --disable-multi-arch and on powerpc64le-linux-gnu. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: Optimize __libc_mtag_tag_zero_regionSzabolcs Nagy2021-03-261-16/+80
| | | | | | | | This is a target hook for memory tagging, the original was a naive implementation. Uses the same algorithm as __libc_mtag_tag_region, but with instructions that also zero the memory. This was not benchmarked on real cpu, but expected to be faster than the naive implementation.
* aarch64: Optimize __libc_mtag_tag_regionSzabolcs Nagy2021-03-261-18/+80
| | | | | | | | This is a target hook for memory tagging, the original was a naive implementation. The optimized version relies on "dc gva" to tag 64 bytes at a time for large allocations and optimizes small cases without adding too many branches. This was not benchmarked on real cpu, but expected to be faster than the naive implementation.
* aarch64: inline __libc_mtag_new_tagSzabolcs Nagy2021-03-263-41/+11
| | | | | This is a common operation when heap tagging is enabled, so inline the instructions instead of using an extern call.
* aarch64: inline __libc_mtag_address_get_tagSzabolcs Nagy2021-03-263-39/+10
| | | | | | | | | | | | This is a common operation when heap tagging is enabled, so inline the instruction instead of using an extern call. The .inst directive is used instead of the name of the instruction (or acle intrinsics) because malloc.c is not compiled for armv8.5-a+memtag architecture, runtime cpu support detection is used. Prototypes are removed from the comments as they were not always correct.
* malloc: Only support zeroing and not arbitrary memset with mtagSzabolcs Nagy2021-03-263-14/+10
| | | | | | | | | | The memset api is suboptimal and does not provide much benefit. Memory tagging only needs a zeroing memset (and only for memory that's sized and aligned to multiples of the tag granule), so change the internal api and the target hooks accordingly. This is to simplify the implementation of the target hook. Reviewed-by: DJ Delorie <dj@redhat.com>
* math: Remove slow paths from asin and acos [BZ #15267]Wilco Dijkstra2021-03-111-1/+1
| | | | | | | | | | | This patch series removes all remaining slow paths and related code. First asin/acos, tan, atan, atan2 implementations are updated, and the final patch removes the unused mpa files, headers and probes. Passes buildmanyglibc. Remove slow paths from asin/acos. Add ULP annotations based on previous slow path checks (which are approximate). Update AArch64 and x86_64 libm-test-ulps. Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
* aarch64: update ulps.Szabolcs Nagy2021-03-011-15/+17
| | | | | For new test cases in commit 5a051454a9b50c27984bbc499ee1297de48e2dc8
* Reduce the statically linked startup code [BZ #23323]Florian Weimer2021-02-251-12/+2
| | | | | | | | | | | | | | | | | | | It turns out the startup code in csu/elf-init.c has a perfect pair of ROP gadgets (see Marco-Gisbert and Ripoll-Ripoll, "return-to-csu: A New Method to Bypass 64-bit Linux ASLR"). These functions are not needed in dynamically-linked binaries because DT_INIT/DT_INIT_ARRAY are already processed by the dynamic linker. However, the dynamic linker skipped the main program for some reason. For maximum backwards compatibility, this is not changed, and instead, the main map is consulted from __libc_start_main if the init function argument is a NULL pointer. For statically linked binaries, the old approach based on linker symbols is still used because there is nothing else available. A new symbol version __libc_start_main@@GLIBC_2.34 is introduced because new binaries running on an old libc would not run their ELF constructors, leading to difficult-to-debug issues.
* aarch64: Fix the list of tested IFUNC variants [BZ #26818]Szabolcs Nagy2021-01-252-4/+6
| | | | | | | | | | | | | | Some IFUNC variants are not compatible with BTI and MTE so don't set them as usable for testing and benchmarking on a BTI or MTE enabled system. As far as IFUNC selectors are concerned a system is BTI enabled if the cpu supports it and glibc was built with BTI branch protection. Most IFUNC variants are BTI compatible, but thunderx2 memcpy and memmove use a jump table with indirect jump, without a BTI j. Fixes bug 26818.
* aarch64: Move and update the definition of MTE_ENABLEDSzabolcs Nagy2021-01-252-11/+11
| | | | | | | | | | | | The hwcap value is now in linux 5.10 and in glibc bits/hwcap.h, so use that definition. Move the definition to init-arch.h so all ifunc selectors can use it and expose an "mte" shorthand for mte enabled runtime. For now we allow user code to enable tag checks and use PROT_MTE mappings without libc involvment, this is not guaranteed ABI, but can be useful for testing and debugging with MTE.
* aarch64: revert memcpy optimze for kunpeng to avoid performance degradationShuo Wang2021-01-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is added to default memcpy version, however, there is performance degradation when the copy size is some large bytes, eg: 100k. This is the result, tested in glibc-2.28: before backport after backport Performance improvement memcpy_1k 0.005 0.005 0.00% memcpy_10k 0.032 0.029 10.34% memcpy_100k 0.356 0.429 -17.02% memcpy_1m 7.470 11.153 -33.02% This is the demo #include "stdio.h" #include "string.h" #include "stdlib.h" char a[1024*1024] = {12}; char b[1024*1024] = {13}; int main(int argc, char *argv[]) { int i = atoi(argv[1]); int j; int size = atoi(argv[2]); for (j = 0; j < i; j++) memcpy(b, a, size*1024); return 0; } # gcc -g -O0 memcpy.c -o memcpy # time taskset -c 10 ./memcpy 100000 1024 Co-authored-by: liqingqing <liqingqing3@huawei.com>
* configure: Check for static PIE supportSzabolcs Nagy2021-01-212-0/+7
| | | | | | | | | | | | | | Add SUPPORT_STATIC_PIE that targets can define if they support static PIE. This requires PI_STATIC_AND_HIDDEN support and various linker features as described in commit 9d7a3741c9e59eba87fb3ca6b9f979befce07826 Add --enable-static-pie configure option to build static PIE [BZ #19574] Currently defined on x86_64, i386 and aarch64 where static PIE is known to work. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: define PI_STATIC_AND_HIDDENSzabolcs Nagy2021-01-082-0/+9
| | | | | | | AArch64 always uses pc relative access to static and hidden object symbols, but the config setting was previously missing. This affects ld.so start up code.
* Remove dbl-64/wordsize-64 (part 2)Wilco Dijkstra2021-01-071-1/+0
| | | | | | | | Remove the wordsize-64 implementations by merging them into the main dbl-64 directory. The second patch just moves all wordsize-64 files and removes a few wordsize-64 uses in comments and Implies files. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>