about summary refs log tree commit diff
path: root/sysdeps/aarch64
Commit message (Collapse)AuthorAgeFilesLines
* glibc.malloc.check: Wean away from malloc hooksSiddhesh Poyarekar2021-07-221-0/+3
| | | | | | | | | | | | | | | | | | The malloc-check debugging feature is tightly integrated into glibc malloc, so thanks to an idea from Florian Weimer, much of the malloc implementation has been moved into libc_malloc_debug.so to support malloc-check. Due to this, glibc malloc and malloc-check can no longer work together; they use altogether different (but identical) structures for heap management. This should not make a difference though since the malloc check hook is not disabled anywhere. malloc_set_state does, but it does so early enough that it shouldn't cause any problems. The malloc check tunable is now in the debug DSO and has no effect when the DSO is not preloaded. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
* AArch64: Add hp-timing.hWilco Dijkstra2021-07-011-0/+47
| | | | | | | | | | Add hp-timing.h using the cntvct_el0 counter. Return timing in nanoseconds so it is fully compatible with generic hp-timing. Don't set HP_TIMING_INLINE in the dynamic linker since it adds unnecessary overheads and some ancient kernels may not handle emulating cntcvt correctly. Currently cntvct_el0 is only used for timing in the benchtests. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Improve strnlen performanceWilco Dijkstra2021-07-011-181/+89
| | | | | | | | Optimize strnlen by avoiding UMINV which is slow on most cores. On Neoverse N1 large strings are 1.8x faster than the current version, and bench-strnlen is 50% faster overall. This version is MTE compatible. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* Update math: redirect roundeven functionH.J. Lu2021-06-272-1/+2
| | | | | Redirect target specific roundeven functions for aarch64, ldbl-128ibm and riscv.
* AArch64: Add support for roundeven[f]Wilco Dijkstra2021-06-082-0/+57
| | | | | | Add inline assembler for the roundeven functions. Passes GLIBC regression. Note GCC does not inline the builtin (PR100966), so this cannot be used for now.
* aarch64: Added optimized memset for A64FXNaohiro Tamura2021-05-274-5/+286
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch optimizes the performance of memset for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill and prefetch. SVE assembler code for memset is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
* aarch64: Added optimized memcpy and memmove for A64FXNaohiro Tamura2021-05-276-13/+443
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch optimizes the performance of memcpy/memmove for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill, and software pipelining. SVE assembler code for memcpy/memmove is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
* aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTINaohiro Tamura2021-05-261-2/+7
| | | | | | | | | This patch defines BTI_C and BTI_J macros conditionally for performance. If HAVE_AARCH64_BTI is true, BTI_C and BTI_J are defined as HINT instruction for ARMv8.5 BTI (Branch Target Identification). If HAVE_AARCH64_BTI is false, both BTI_C and BTI_J are defined as NOP.
* config: Added HAVE_AARCH64_SVE_ASM for aarch64Naohiro Tamura2021-05-262-0/+43
| | | | | This patch checks if assembler supports '-march=armv8.2-a+sve' to generate SVE code or not, and then define HAVE_AARCH64_SVE_ASM macro.
* elf: Remove lazy tlsdesc relocation related codeSzabolcs Nagy2021-04-211-1/+0
| | | | | | | | | | | Remove generic tlsdesc code related to lazy tlsdesc processing since lazy tlsdesc relocation is no longer supported. This includes removing GL(dl_load_lock) from _dl_make_tlsdesc_dynamic which is only called at load time when that lock is already held. Added a documentation comment too. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: update libm test ulpsSzabolcs Nagy2021-04-081-1/+1
| | | | Update after commit 43576de04afc6a0896a3ecc094e1581069a0652a.
* aarch64: free tlsdesc data on dlclose [BZ #27403]Szabolcs Nagy2021-04-061-0/+27
| | | | | | | | DL_UNMAP_IS_SPECIAL and DL_UNMAP were not defined. The definitions are now copied from arm, since the same is needed on aarch64. The cleanup of tlsdesc data is handled by the custom _dl_unmap. Fixes bug 27403.
* Fix the inaccuracy of j0f/j1f/y0f/y1f [BZ #14469, #14470, #14471, #14472]Paul Zimmermann2021-04-021-35/+35
| | | | | | | | | | | | | | | | | | | | | | | For j0f/j1f/y0f/y1f, the largest error for all binary32 inputs is reduced to at most 9 ulps for all rounding modes. The new code is enabled only when there is a cancellation at the very end of the j0f/j1f/y0f/y1f computation, or for very large inputs, thus should not give any visible slowdown on average. Two different algorithms are used: * around the first 64 zeros of j0/j1/y0/y1, approximation polynomials of degree 3 are used, computed using the Sollya tool (https://www.sollya.org/) * for large inputs, an asymptotic formula from [1] is used [1] Fast and Accurate Bessel Function Computation, John Harrison, Proceedings of Arith 19, 2009. Inputs yielding the new largest errors are added to auto-libm-test-in, and ulps are regenerated for various targets (thanks Adhemerval Zanella). Tested on x86_64 with --disable-multi-arch and on powerpc64le-linux-gnu. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: Optimize __libc_mtag_tag_zero_regionSzabolcs Nagy2021-03-261-16/+80
| | | | | | | | This is a target hook for memory tagging, the original was a naive implementation. Uses the same algorithm as __libc_mtag_tag_region, but with instructions that also zero the memory. This was not benchmarked on real cpu, but expected to be faster than the naive implementation.
* aarch64: Optimize __libc_mtag_tag_regionSzabolcs Nagy2021-03-261-18/+80
| | | | | | | | This is a target hook for memory tagging, the original was a naive implementation. The optimized version relies on "dc gva" to tag 64 bytes at a time for large allocations and optimizes small cases without adding too many branches. This was not benchmarked on real cpu, but expected to be faster than the naive implementation.
* aarch64: inline __libc_mtag_new_tagSzabolcs Nagy2021-03-263-41/+11
| | | | | This is a common operation when heap tagging is enabled, so inline the instructions instead of using an extern call.
* aarch64: inline __libc_mtag_address_get_tagSzabolcs Nagy2021-03-263-39/+10
| | | | | | | | | | | | This is a common operation when heap tagging is enabled, so inline the instruction instead of using an extern call. The .inst directive is used instead of the name of the instruction (or acle intrinsics) because malloc.c is not compiled for armv8.5-a+memtag architecture, runtime cpu support detection is used. Prototypes are removed from the comments as they were not always correct.
* malloc: Only support zeroing and not arbitrary memset with mtagSzabolcs Nagy2021-03-263-14/+10
| | | | | | | | | | The memset api is suboptimal and does not provide much benefit. Memory tagging only needs a zeroing memset (and only for memory that's sized and aligned to multiples of the tag granule), so change the internal api and the target hooks accordingly. This is to simplify the implementation of the target hook. Reviewed-by: DJ Delorie <dj@redhat.com>
* math: Remove slow paths from asin and acos [BZ #15267]Wilco Dijkstra2021-03-111-1/+1
| | | | | | | | | | | This patch series removes all remaining slow paths and related code. First asin/acos, tan, atan, atan2 implementations are updated, and the final patch removes the unused mpa files, headers and probes. Passes buildmanyglibc. Remove slow paths from asin/acos. Add ULP annotations based on previous slow path checks (which are approximate). Update AArch64 and x86_64 libm-test-ulps. Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
* aarch64: update ulps.Szabolcs Nagy2021-03-011-15/+17
| | | | | For new test cases in commit 5a051454a9b50c27984bbc499ee1297de48e2dc8
* Reduce the statically linked startup code [BZ #23323]Florian Weimer2021-02-251-12/+2
| | | | | | | | | | | | | | | | | | | It turns out the startup code in csu/elf-init.c has a perfect pair of ROP gadgets (see Marco-Gisbert and Ripoll-Ripoll, "return-to-csu: A New Method to Bypass 64-bit Linux ASLR"). These functions are not needed in dynamically-linked binaries because DT_INIT/DT_INIT_ARRAY are already processed by the dynamic linker. However, the dynamic linker skipped the main program for some reason. For maximum backwards compatibility, this is not changed, and instead, the main map is consulted from __libc_start_main if the init function argument is a NULL pointer. For statically linked binaries, the old approach based on linker symbols is still used because there is nothing else available. A new symbol version __libc_start_main@@GLIBC_2.34 is introduced because new binaries running on an old libc would not run their ELF constructors, leading to difficult-to-debug issues.
* aarch64: Fix the list of tested IFUNC variants [BZ #26818]Szabolcs Nagy2021-01-252-4/+6
| | | | | | | | | | | | | | Some IFUNC variants are not compatible with BTI and MTE so don't set them as usable for testing and benchmarking on a BTI or MTE enabled system. As far as IFUNC selectors are concerned a system is BTI enabled if the cpu supports it and glibc was built with BTI branch protection. Most IFUNC variants are BTI compatible, but thunderx2 memcpy and memmove use a jump table with indirect jump, without a BTI j. Fixes bug 26818.
* aarch64: Move and update the definition of MTE_ENABLEDSzabolcs Nagy2021-01-252-11/+11
| | | | | | | | | | | | The hwcap value is now in linux 5.10 and in glibc bits/hwcap.h, so use that definition. Move the definition to init-arch.h so all ifunc selectors can use it and expose an "mte" shorthand for mte enabled runtime. For now we allow user code to enable tag checks and use PROT_MTE mappings without libc involvment, this is not guaranteed ABI, but can be useful for testing and debugging with MTE.
* aarch64: revert memcpy optimze for kunpeng to avoid performance degradationShuo Wang2021-01-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is added to default memcpy version, however, there is performance degradation when the copy size is some large bytes, eg: 100k. This is the result, tested in glibc-2.28: before backport after backport Performance improvement memcpy_1k 0.005 0.005 0.00% memcpy_10k 0.032 0.029 10.34% memcpy_100k 0.356 0.429 -17.02% memcpy_1m 7.470 11.153 -33.02% This is the demo #include "stdio.h" #include "string.h" #include "stdlib.h" char a[1024*1024] = {12}; char b[1024*1024] = {13}; int main(int argc, char *argv[]) { int i = atoi(argv[1]); int j; int size = atoi(argv[2]); for (j = 0; j < i; j++) memcpy(b, a, size*1024); return 0; } # gcc -g -O0 memcpy.c -o memcpy # time taskset -c 10 ./memcpy 100000 1024 Co-authored-by: liqingqing <liqingqing3@huawei.com>
* configure: Check for static PIE supportSzabolcs Nagy2021-01-212-0/+7
| | | | | | | | | | | | | | Add SUPPORT_STATIC_PIE that targets can define if they support static PIE. This requires PI_STATIC_AND_HIDDEN support and various linker features as described in commit 9d7a3741c9e59eba87fb3ca6b9f979befce07826 Add --enable-static-pie configure option to build static PIE [BZ #19574] Currently defined on x86_64, i386 and aarch64 where static PIE is known to work. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: define PI_STATIC_AND_HIDDENSzabolcs Nagy2021-01-082-0/+9
| | | | | | | AArch64 always uses pc relative access to static and hidden object symbols, but the config setting was previously missing. This affects ld.so start up code.
* Remove dbl-64/wordsize-64 (part 2)Wilco Dijkstra2021-01-071-1/+0
| | | | | | | | Remove the wordsize-64 implementations by merging them into the main dbl-64 directory. The second patch just moves all wordsize-64 files and removes a few wordsize-64 uses in comments and Implies files. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: push the set of rules before falling into slow pathShuo Wang2021-01-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is supposed to save the rules for the instructions before falling into slow path. Tested in glibc-2.28 before fixing: Thread 2 "xxxxxxx" hit Breakpoint 1, _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:149 149 stp x1, x2, [sp, #-32]! Missing separate debuginfos, use: dnf debuginfo-install libgcc-7.3.0-20190804.h24.aarch64 (gdb) ni _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:150 150 stp x3, x4, [sp, #16] (gdb) _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:157 157 mrs x4, tpidr_el0 (gdb) 158 ldr PTR_REG (1), [x0,#TLSDESC_ARG] (gdb) 159 ldr PTR_REG (0), [x4,#TCBHEAD_DTV] (gdb) 160 ldr PTR_REG (3), [x1,#TLSDESC_GEN_COUNT] (gdb) 161 ldr PTR_REG (2), [x0,#DTV_COUNTER] (gdb) 162 cmp PTR_REG (3), PTR_REG (2) (gdb) 163 b.hi 2f (gdb) 165 ldp PTR_REG (2), PTR_REG (3), [x1,#TLSDESC_MODID] (gdb) 166 add PTR_REG (0), PTR_REG (0), PTR_REG (2), lsl #(PTR_LOG_SIZE + 1) (gdb) 167 ldr PTR_REG (0), [x0] /* Load val member of DTV entry. */ (gdb) 168 cmp PTR_REG (0), #TLS_DTV_UNALLOCATED (gdb) 169 b.eq 2f (gdb) bt #0 _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:169 #1 0x0000ffffbe4fbb44 in OurFunction (threadId=4294967295) at /home/test/test_function.c:30 #2 0x0000000000400c08 in initaaa () at thread.c:58 #3 0x0000000000400c50 in thread_proc (param=0x0) at thread.c:71 #4 0x0000ffffbf6918bc in start_thread (arg=0xfffffffff29f) at pthread_create.c:486 #5 0x0000ffffbf5669ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 (gdb) ni _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:184 184 stp x29, x30, [sp,#-16*NSAVEXREGPAIRS]! (gdb) bt #0 _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:184 #1 0x0000ffffbe4fbb44 in OurFunction (threadId=4294967295) at /home/test/test_function.c:30 #2 0x0000000000000000 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) Co-authored-by: liqingqing <liqingqing3@huawei.com>
* aarch64: fix stack missing after sp is updatedShuo Wang2021-01-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After sp is updated, the CFA offset should be set before next instruction. Tested in glibc-2.28: Thread 2 "xxxxxxx" hit Breakpoint 1, _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:149 149 stp x1, x2, [sp, #-32]! Missing separate debuginfos, use: dnf debuginfo-install libgcc-7.3.0-20190804.h24.aarch64 (gdb) bt #0 _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:149 #1 0x0000ffffbe4fbb44 in OurFunction (threadId=3194870184) at /home/test/test_function.c:30 #2 0x0000000000400c08 in initaaa () at thread.c:58 #3 0x0000000000400c50 in thread_proc (param=0x0) at thread.c:71 #4 0x0000ffffbf6918bc in start_thread (arg=0xfffffffff29f) at pthread_create.c:486 #5 0x0000ffffbf5669ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 (gdb) ni _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:150 150 stp x3, x4, [sp, #16] (gdb) bt #0 _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:150 #1 0x0000ffffbe4fbb44 in OurFunction (threadId=3194870184) at /home/test/test_function.c:30 #2 0x0000000000000000 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) ni _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:157 157 mrs x4, tpidr_el0 (gdb) bt #0 _dl_tlsdesc_dynamic () at ../sysdeps/aarch64/dl-tlsdesc.S:157 #1 0x0000ffffbe4fbb44 in OurFunction (threadId=3194870184) at /home/test/test_function.c:30 #2 0x0000000000400c08 in initaaa () at thread.c:58 #3 0x0000000000400c50 in thread_proc (param=0x0) at thread.c:71 #4 0x0000ffffbf6918bc in start_thread (arg=0xfffffffff29f) at pthread_create.c:486 #5 0x0000ffffbf5669ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 Signed-off-by: liqingqing <liqingqing3@huawei.com> Signed-off-by: Shuo Wang <wangshuo47@huawei.com>
* Update copyright dates with scripts/update-copyrightsPaul Eggert2021-01-02128-128/+128
| | | | | | | | | | | | | | | | I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 6694 files FOO. I then removed trailing white space from benchtests/bench-pthread-locks.c and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this diagnostic from Savannah: remote: *** pre-commit check failed ... remote: *** error: lines with trailing whitespace found remote: error: hook declined to update refs/heads/master
* aarch64: use PTR_ARG and SIZE_ARG instead of DELOUSESzabolcs Nagy2020-12-3124-65/+65
| | | | | | | | | | | | | | | DELOUSE was added to asm code to make them compatible with non-LP64 ABIs, but it is an unfortunate name and the code was not compatible with ABIs where pointer and size_t are different. Glibc currently only supports the LP64 ABI so these macros are not really needed or tested, but for now the name is changed to be more meaningful instead of removing them completely. Some DELOUSE macros were dropped: clone, strlen and strnlen used it unnecessarily. The out of tree ILP32 patches are currently not maintained and will likely need a rework to rebase them on top of the time64 changes.
* aarch64: update ulps.Szabolcs Nagy2020-12-211-10/+12
| | | | | For new test cases in commit cad5ad81d2f7f58a7ad0d8afa8c1b7101a0301fb
* aarch64: Add aarch64-specific files for memory tagging supportRichard Earnshaw2020-12-216-0/+235
| | | | | This final patch provides the architecture-specific implementation of the memory-tagging support hooks for aarch64.
* aarch64: remove the strlen_asimd symbolSzabolcs Nagy2020-12-151-2/+1
| | | | | | This symbol is not in the implementation reserved namespace for static linking and it was never used: it seems it was mistakenly added in the orignal strlen_asimd commit 436e4d5b965abe592d26150cb518accf9ded8fe4
* aarch64: fix static PIE start code for BTI [BZ #27068]Guillaume Gardet2020-12-151-0/+1
| | | | | | | A bti c was missing from rcrt1.o which made all -static-pie binaries fail at program startup on BTI enabled systems. Fixes bug 27068.
* aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]Szabolcs Nagy2020-12-113-19/+43
| | | | | | | | | | | | | | | | | | | | | Re-mmap executable segments if possible instead of using mprotect to add PROT_BTI. This allows using BTI protection with security policies that prevent mprotect with PROT_EXEC. If the fd of the ELF module is not available because it was kernel mapped then mprotect is used and failures are ignored. To protect the main executable even when mprotect is filtered the linux kernel will have to be changed to add PROT_BTI to it. The delayed failure reporting is mainly needed because currently _dl_process_gnu_properties does not propagate failures such that the required cleanups happen. Using the link_map_machine struct for error propagation is not ideal, but this seemed to be the least intrusive solution. Fixes bug 26831. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* elf: Pass the fd to note processingSzabolcs Nagy2020-12-111-3/+3
| | | | | | | | | | | | | | To handle GNU property notes on aarch64 some segments need to be mmaped again, so the fd of the loaded ELF module is needed. When the fd is not available (kernel loaded modules), then -1 is passed. The fd is passed to both _dl_process_pt_gnu_property and _dl_process_pt_note for consistency. Target specific note processing functions are updated accordingly. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: align address for BTI protection [BZ #26988]Szabolcs Nagy2020-12-111-6/+8
| | | | | | | | | | | | Handle unaligned executable load segments (the bfd linker is not expected to produce such binaries, but other linkers may). Computing the mapping bounds follows _dl_map_object_from_fd more closely now. Fixes bug 26988. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: Fix missing BTI protection from dependencies [BZ #26926]Szabolcs Nagy2020-12-111-2/+15
| | | | | | | | | | | The _dl_open_check and _rtld_main_check hooks are not called on the dependencies of a loaded module, so BTI protection was missed on every module other than the main executable and directly dlopened libraries. The fix just iterates over dependencies to enable BTI. Fixes bug 26926.
* nptl: Move stack list variables into _rtld_globalFlorian Weimer2020-11-161-2/+0
| | | | | | | | | Now __thread_gscope_wait (the function behind THREAD_GSCOPE_WAIT, formerly __wait_lookup_done) can be implemented directly in ld.so, eliminating the unprotected GL (dl_wait_lookup_done) function pointer. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: Add unwind information to _start (bug 26853)Florian Weimer2020-11-091-4/+3
| | | | | | | This adds CFI directives which communicate that the stack ends with this function. Fixes bug 26853.
* aarch64: Add variant PCS lazy binding test [BZ #26798]Szabolcs Nagy2020-11-025-0/+288
| | | | | | | | | | | This test fails without bug 26798 fixed because some integer registers likely get clobbered by lazy binding and variant PCS only allows x16 and x17 to be clobbered at call time. The test requires binutils 2.32.1 or newer for handling variant PCS symbols. SVE registers are not covered by this test, to avoid the complexity of handling multiple compile- and runtime feature support cases.
* aarch64: Fix DT_AARCH64_VARIANT_PCS handling [BZ #26798]Szabolcs Nagy2020-11-021-8/+4
| | | | | | | | | | | | | The variant PCS support was ineffective because in the common case linkmap->l_mach.plt == 0 but then the symbol table flags were ignored and normal lazy binding was used instead of resolving the relocs early. (This was a misunderstanding about how GOT[1] is setup by the linker.) In practice this mainly affects SVE calls when the vector length is more than 128 bits, then the top bits of the argument registers get clobbered during lazy binding. Fixes bug 26798.
* AArch64: Use __memcpy_simd on Neoverse N2/V1Wilco Dijkstra2020-10-142-2/+4
| | | | | | | Add CPU detection of Neoverse N2 and Neoverse V1, and select __memcpy_simd as the memcpy/memmove ifunc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: enforce >=64K guard size [BZ #26691]Szabolcs Nagy2020-10-021-0/+3
| | | | | | | | | | | | | | | | | | | | | | | There are several compiler implementations that allow large stack allocations to jump over the guard page at the end of the stack and corrupt memory beyond that. See CVE-2017-1000364. Compilers can emit code to probe the stack such that the guard page cannot be skipped, but on aarch64 the probe interval is 64K by default instead of the minimum supported page size (4K). This patch enforces at least 64K guard on aarch64 unless the guard is disabled by setting its size to 0. For backward compatibility reasons the increased guard is not reported, so it is only observable by exhausting the address space or parsing /proc/self/maps on linux. On other targets the patch has no effect. If the stack probe interval is larger than a page size on a target then ARCH_MIN_GUARD_SIZE can be defined to get large enough stack guard on libc allocated stacks. The patch does not affect threads with user allocated stacks. Fixes bug 26691.
* AArch64: Improve backwards memmove performanceWilco Dijkstra2020-08-281-3/+4
| | | | | | | | On some microarchitectures performance of the backwards memmove improves if the stores use STR with decreasing addresses. So change the memmove loop in memcpy_advsimd.S to use 2x STR rather than STP. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: update ulps.Szabolcs Nagy2020-08-131-1/+1
| | | | For new j0 test.
* aarch64: Use future HWCAP2_MTE in ifunc resolverSzabolcs Nagy2020-07-271-2/+8
| | | | | | | | | | | Make glibc MTE-safe on systems where MTE is available. This allows using heap tagging with an LD_PRELOADed malloc implementation that enables MTE. We don't document this as guaranteed contract yet, so glibc may not be MTE safe when HWCAP2_MTE is set (older glibcs certainly aren't). This is mainly for testing and debugging. The HWCAP flag is not exposed in public headers until Linux adds it to its uapi. The HWCAP value reservation will be in Linux 5.9.
* aarch64: Respect p_flags when protecting code with PROT_BTISzabolcs Nagy2020-07-241-1/+8
| | | | | | | | | | | | Use PROT_READ and PROT_WRITE according to the load segment p_flags when adding PROT_BTI. This is before processing relocations which may drop PROT_BTI in case of textrels. Executable stacks are not protected via PROT_BTI either. PROT_BTI is hardening in case memory corruption happened, it's value is reduced if there is writable and executable memory available so missing it on such memory is fine, but we should respect the p_flags and should not drop PROT_WRITE.
* AArch64: Improve strlen_asimd performance (bug 25824)Wilco Dijkstra2020-07-175-126/+161
| | | | | | | | | | | | | | | | | Optimize strlen using a mix of scalar and SIMD code. On modern micro architectures large strings are 2.6 times faster than existing strlen_asimd and 35% faster than the new MTE version of strlen. On a random strlen benchmark using small sizes the speedup is 7% vs strlen_asimd and 40% vs the MTE strlen. This fixes the main strlen regressions on Cortex-A53 and other cores with a simple Neon unit. Rename __strlen_generic to __strlen_mte, and select strlen_asimd when MTE is not enabled (this is waiting on support for a HWCAP_MTE bit). This fixes big-endian bug 25824. Passes GLIBC regression tests. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>