about summary refs log tree commit diff
path: root/sysdeps/aarch64
Commit message (Collapse)AuthorAgeFilesLines
* aarch64: Add vector implementations of exp10 routinesJoe Ramsay2023-10-2312-0/+524
| | | | | Double-precision routines either reuse the exp table (AdvSIMD) or use SVE FEXPA intruction.
* aarch64: Add vector implementations of log10 routinesJoe Ramsay2023-10-2314-1/+580
| | | | A table is also added, which is shared between AdvSIMD and SVE log10.
* aarch64: Add vector implementations of log2 routinesJoe Ramsay2023-10-2314-1/+545
| | | | A table is also added, which is shared between AdvSIMD and SVE log2.
* aarch64: Add vector implementations of exp2 routinesJoe Ramsay2023-10-2312-0/+459
| | | | Some routines reuse table from v_exp_data.c
* aarch64: Add vector implementations of tan routinesJoe Ramsay2023-10-2318-1/+1244
| | | | | This includes some utility headers for evaluating polynomials using various schemes.
* aarch64: Optimise vecmath logsJoe Ramsay2023-10-057-215/+226
| | | | | | | | | | | * Transpose table layout for improved memory access * Use half-vector special comparisons for AdvSIMD * Improve register use near special-case branches - Due to the presence of a function call, return value would get mov-d out of x0 in order to facilitate PCS. By moving the final computation after the branch this can be avoided Also change SVE routines to use overloaded intrinsics for readability.
* aarch64: Cosmetic change in SVE exp routinesJoe Ramsay2023-10-052-47/+44
| | | | | | | Use overloaded intrinsics for readability. Codegen does not change, however while we're bringing the routines up-to-date with recent improvements to other routines in AOR it is worth copying this change over as well.
* aarch64: Optimize SVE cos & cosfJoe Ramsay2023-10-052-53/+47
| | | | | | Saves a mov by ensuring return value does not need to be moved out of the way before special-case branch. Also change to use overloaded intrinsics.
* aarch64: Improve vecmath sin routinesJoe Ramsay2023-10-053-73/+87
| | | | | | | | * Update ULP comment reflecting a new observed max in [-pi/2, pi/2] * Use the same polynomial in AdvSIMD and SVE, rather than FTRIG instructions * Improve register use near special-case branch Also use overloaded intrinsics for SVE.
* AArch64: Remove -0.0 check from vector sinWilco Dijkstra2023-09-262-12/+2
| | | | | | | Remove the unnecessary extra checks for sin (-0.0) from vector sin/sinf, improving performance. Passes regress. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* configure: Use autoconf 2.71Siddhesh Poyarekar2023-07-171-91/+111
| | | | | | | | | | | | | | Bump autoconf requirement to 2.71 to allow regenerating configure on more recent distributions. autoconf 2.71 has been in Fedora since F36 and is the current version in Debian stable (bookworm). It appears to be current in Gentoo as well. All sysdeps configure and preconfigure scripts have also been regenerated; all changes are trivial transformations that do not affect functionality. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* aarch64: Add vector implementations of exp routinesJoe Ramsay2023-06-3014-1/+593
| | | | | | | | | | | | Optimised implementations for single and double precision, Advanced SIMD and SVE, copied from Arm Optimized Routines. As previously, data tables are used via a barrier to prevent overly aggressive constant inlining. Special-case handlers are marked NOINLINE to avoid incurring the penalty of switching call standards unnecessarily. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* aarch64: Add vector implementations of log routinesJoe Ramsay2023-06-3014-1/+559
| | | | | | | | | | | | | | Optimised implementations for single and double precision, Advanced SIMD and SVE, copied from Arm Optimized Routines. Log lookup table added as HIDDEN symbol to allow it to be shared between AdvSIMD and SVE variants. As previously, data tables are used via a barrier to prevent overly aggressive constant inlining. Special-case handlers are marked NOINLINE to avoid incurring the penalty of switching call standards unnecessarily. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* aarch64: Add vector implementations of sin routinesJoe Ramsay2023-06-3012-6/+426
| | | | | | | | | | | | Optimised implementations for single and double precision, Advanced SIMD and SVE, copied from Arm Optimized Routines. As previously, data tables are used via a barrier to prevent overly aggressive constant inlining. Special-case handlers are marked NOINLINE to avoid incurring the penalty of switching call standards unnecessarily. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* aarch64: Add vector implementations of cos routinesJoe Ramsay2023-06-3010-117/+608
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the loop-over-scalar placeholder routines with optimised implementations from Arm Optimized Routines (AOR). Also add some headers containing utilities for aarch64 libmvec routines, and update libm-test-ulps. Data tables for new routines are used via a pointer with a barrier on it, in order to prevent overly aggressive constant inlining in GCC. This allows a single adrp, combined with offset loads, to be used for every constant in the table. Special-case handlers are marked NOINLINE in order to confine the save/restore overhead of switching from vector to normal calling standard. This way we only incur the extra memory access in the exceptional cases. NOINLINE definitions have been moved to math_private.h in order to reduce duplication. AOR exposes a config option, WANT_SIMD_EXCEPT, to enable selective masking (and later fixing up) of invalid lanes, in order to trigger fp exceptions correctly (AdvSIMD only). This is tested and maintained in AOR, however it is configured off at source level here for performance reasons. We keep the WANT_SIMD_EXCEPT blocks in routine sources to greatly simplify the upstreaming process from AOR to glibc. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* Fix a few more typos I missed in previous round -- BZ 25337Paul Pluzhnikov2023-06-021-1/+1
|
* Fix misspellings in sysdeps/ -- BZ 25337Paul Pluzhnikov2023-05-302-5/+5
|
* aarch64: More configure checks for libmvecSzabolcs Nagy2023-05-052-6/+48
| | | | | Check assembler and linker support too, not just SVE ACLE in the compiler, since variant PCS requires at least binutils 2.32.1.
* aarch64: SVE ACLE configure test cleanupsSzabolcs Nagy2023-05-052-16/+27
| | | | Use more idiomatic configure test for better autoconf cache and logs.
* aarch64: fix SVE ACLE check for bootstrap glibc buildsSzabolcs Nagy2023-05-042-2/+2
| | | | | | | | | | | | | | arm_sve.h depends on stdint.h but that relies on libc headers unless compiled in freestanding mode. Without this change a bootstrap glibc build (that uses a compiler without installed libc headers) failed with checking for availability of SVE ACLE... In file included from [...]/arm_sve.h:28, from conftest.c:1: [...]/stdint.h:9:16: fatal error: stdint.h: No such file or directory 9 | # include_next <stdint.h> | ^~~~~~~~~~ compilation terminated. configure: error: mathvec is enabled but compiler does not have SVE ACLE. [...]
* Enable libmvec support for AArch64Joe Ramsay2023-05-0325-0/+910
| | | | | | | | | | | | | | | | | | | | | | | This patch enables libmvec on AArch64. The proposed change is mainly implementing build infrastructure to add the new routines to ABI, tests and benchmarks. I have demonstrated how this all fits together by adding implementations for vector cos, in both single and double precision, targeting both Advanced SIMD and SVE. The implementations of the routines themselves are just loops over the scalar routine from libm for now, as we are more concerned with getting the plumbing right at this point. We plan to contribute vector routines from the Arm Optimized Routines repo that are compliant with requirements described in the libmvec wiki. Building libmvec requires minimum GCC 10 for SVE ACLE. To avoid raising the minimum GCC by such a big jump, we allow users to disable libmvec if their compiler is too old. Note that at this point users have to manually call the vector math functions. This seems to be acceptable to some downstream users. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* aarch64: update libm test ulpsSzabolcs Nagy2023-02-241-0/+1
|
* AArch64: Fix HP_TIMING_DIFF computation [BZ# 29329]Jun Tang2023-02-221-1/+1
| | | | | | | | Fix the computation to allow for cntfrq_el0 being larger than 1GHz. Assume cntfrq_el0 is a multiple of 1MHz to increase the maximum interval (1024 seconds at 1GHz). Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
* string: Remove string_private.hAdhemerval Zanella2023-02-171-20/+0
| | | | | | Now that _STRING_ARCH_unaligned is not used anymore. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
* string: Add libc_hidden_proto for memrchrAdhemerval Zanella2023-02-081-0/+1
| | | | | | | Although static linker can optimize it to local call, it follows the internal scheme to provide hidden proto and definitions. Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
* string: Add libc_hidden_proto for strchrnulAdhemerval Zanella2023-02-081-0/+1
| | | | | | | Although static linker can optimize it to local call, it follows the internal scheme to provide hidden proto and definitions. Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org>
* AArch64: Improve SVE memcpy and memmoveWilco Dijkstra2023-02-061-20/+14
| | | | | | | | Improve SVE memcpy by copying 2 vectors if the size is small enough. This improves performance of random memcpy by ~9% on Neoverse V1, and 33-64 byte copies are ~16% faster. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Improve strrchrWilco Dijkstra2023-01-171-25/+33
| | | | | | | | Use shrn for narrowing the mask which simplifies code and speeds up small strings. Unroll the first search loop to improve performance on large strings. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Optimize strnlenWilco Dijkstra2023-01-171-21/+18
| | | | | | | | Optimize strnlen using the shrn instruction and improve the main loop. Small strings are around 10% faster, large strings are 40% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Optimize strlenWilco Dijkstra2023-01-171-8/+12
| | | | | | | Optimize strlen by unrolling the main loop. Large strings are 64% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Optimize strcpyWilco Dijkstra2023-01-171-17/+19
| | | | | | Unroll the main loop. Large strings are around 20% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Improve strchrnulWilco Dijkstra2023-01-171-2/+10
| | | | | | Unroll the main loop, which improves performance slightly. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Optimize strchrWilco Dijkstra2023-01-171-28/+24
| | | | | | | Simplify calculation of the mask using shrn. Unroll the main loop. Small strings are 20% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Improve strlen_asimdWilco Dijkstra2023-01-171-12/+4
| | | | | | | Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment. Performance improves slightly as a result. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Optimize memrchrWilco Dijkstra2023-01-171-9/+11
| | | | | | Optimize the main loop - large strings are 43% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Optimize memchrWilco Dijkstra2023-01-171-13/+14
| | | | | | Optimize the main loop - large strings are 40% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* Update copyright dates with scripts/update-copyrightsJoseph Myers2023-01-06135-135/+135
|
* elf: Fix rtld-audit trampoline for aarch64Vladislav Khmelevsky2022-11-211-3/+1
| | | | | | | | | | | | | | | This patch fixes two problems with audit: 1. The DL_OFFSET_RV_VPCS offset was mixed up with DL_OFFSET_RG_VPCS, resulting in x2 register value nulling in RG structure. 2. We need to preserve the x8 register before function call, but don't have to save it's new value and restore it before return. Anyway the final restore was using OFFSET_RV instead of OFFSET_RG value which is wrong (althoug doesn't affect anything). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* elf: Introduce <dl-call_tls_init_tp.h> and call_tls_init_tp (bug 29249)Florian Weimer2022-11-031-1/+1
| | | | | | | | This makes it more likely that the compiler can compute the strlen argument in _startup_fatal at compile time, which is required to avoid a dependency on strlen this early during process startup. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* configure: Use -Wno-ignored-attributes if compiler warns about multiple aliasesAdhemerval Zanella2022-11-011-0/+1
| | | | | | | | | clang emits an warning when a double alias redirection is used, to warn the the original symbol will be used even when weak definition is overridden. However, this is a common pattern for weak_alias, where multiple alias are set to same symbol. Reviewed-by: Fangrui Song <maskray@google.com>
* aarch64: Don't build wordcopySzabolcs Nagy2022-10-281-0/+0
| | | | | | | Use an empty wordcopy.c to avoid building the generic one. It does not seem to be used anywhere. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* aarch64: Use memcpy_simd as the default memcpyWilco Dijkstra2022-10-266-370/+81
| | | | | | Since __memcpy_simd is the fastest memcpy on almost all cores, replace the generic memcpy with it. If SVE is available, a SVE memcpy will be used by default (including for Neoverse N2).
* aarch64: Cleanup memset ifuncWilco Dijkstra2022-10-262-17/+26
| | | | | Cleanup memset ifunc selectors. The A64FX memset relies on a ZVA size of 256, so add an explicit check.
* Use PTR_MANGLE and PTR_DEMANGLE unconditionally in C sourcesFlorian Weimer2022-10-181-2/+0
| | | | | | | | | | | | | | | | | In the future, this will result in a compilation failure if the macros are unexpectedly undefined (due to header inclusion ordering or header inclusion missing altogether). Assembler sources are more difficult to convert. In many cases, they are hand-optimized for the mangling and no-mangling variants, which is why they are not converted. sysdeps/s390/s390-32/__longjmp.c and sysdeps/s390/s390-64/__longjmp.c are special: These are C sources, but most of the implementation is in assembler, so the PTR_DEMANGLE macro has to be undefined in some cases, to match the assembler style. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* Introduce <pointer_guard.h>, extracted from <sysdep.h>Florian Weimer2022-10-183-0/+3
| | | | | | | | | | | | | | This allows us to define a generic no-op version of PTR_MANGLE and PTR_DEMANGLE. In the future, we can use PTR_MANGLE and PTR_DEMANGLE unconditionally in C sources, avoiding an unintended loss of hardening due to missing include files or unlucky header inclusion ordering. In i386 and x86_64, we can avoid a <tls.h> dependency in the C code by using the computed constant from <tcb-offsets.h>. <sysdep.h> no longer includes these definitions, so there is no cyclic dependency anymore when computing the <tcb-offsets.h> constants. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* elf: Remove -fno-tree-loop-distribute-patterns usage on dl-supportAdhemerval Zanella2022-10-101-0/+24
| | | | | | | | | | | | | | Besides the option being gcc specific, this approach is still fragile and not future proof since we do not know if this will be the only optimization option gcc will add that transforms loops to memset (or any libcall). This patch adds a new header, dl-symbol-redir-ifunc.h, that can b used to redirect the compiler generated libcalls to port the generic memset implementation if required. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* Use atomic_exchange_release/acquireWilco Dijkstra2022-09-261-1/+1
| | | | | | | Rename atomic_exchange_rel/acq to use atomic_exchange_release/acquire since these map to the standard C11 atomic builtins. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* csu: Change start code license to have link exceptionSzabolcs Nagy2022-08-261-4/+21
| | | | | | | | | | | | | | | The start code can get linked into dynamic linked executables where LGPL would require shipping the source or linkable binaries when the executable is distributed. On some targets the license exception was missing in start.S (which is compiled into crt1.o and Scrt1.o which may end up linked into PDE and PIE binaries). I did not review what other code may end up in executables, just fixed the start.S license inconsistency across targets. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* AArch64: Fix typo in sve configure check (BZ# 29394)Wilco Dijkstra2022-08-112-4/+4
| | | | Fix a typo in the SVE configure check. This fixes [BZ# 29394].
* arc4random: simplify design for better safetyJason A. Donenfeld2022-07-273-358/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than buffering 16 MiB of entropy in userspace (by way of chacha20), simply call getrandom() every time. This approach is doubtlessly slower, for now, but trying to prematurely optimize arc4random appears to be leading toward all sorts of nasty properties and gotchas. Instead, this patch takes a much more conservative approach. The interface is added as a basic loop wrapper around getrandom(), and then later, the kernel and libc together can work together on optimizing that. This prevents numerous issues in which userspace is unaware of when it really must throw away its buffer, since we avoid buffering all together. Future improvements may include userspace learning more from the kernel about when to do that, which might make these sorts of chacha20-based optimizations more possible. The current heuristic of 16 MiB is meaningless garbage that doesn't correspond to anything the kernel might know about. So for now, let's just do something conservative that we know is correct and won't lead to cryptographic issues for users of this function. This patch might be considered along the lines of, "optimization is the root of all evil," in that the much more complex implementation it replaces moves too fast without considering security implications, whereas the incremental approach done here is a much safer way of going about things. Once this lands, we can take our time in optimizing this properly using new interplay between the kernel and userspace. getrandom(0) is used, since that's the one that ensures the bytes returned are cryptographically secure. But on systems without it, we fallback to using /dev/urandom. This is unfortunate because it means opening a file descriptor, but there's not much of a choice. Secondly, as part of the fallback, in order to get more or less the same properties of getrandom(0), we poll on /dev/random, and if the poll succeeds at least once, then we assume the RNG is initialized. This is a rough approximation, as the ancient "non-blocking pool" initialized after the "blocking pool", not before, and it may not port back to all ancient kernels, though it does to all kernels supported by glibc (≥3.2), so generally it's the best approximation we can do. The motivation for including arc4random, in the first place, is to have source-level compatibility with existing code. That means this patch doesn't attempt to litigate the interface itself. It does, however, choose a conservative approach for implementing it. Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Cristian Rodríguez <crrodriguez@opensuse.org> Cc: Paul Eggert <eggert@cs.ucla.edu> Cc: Mark Harris <mark.hsj@gmail.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: linux-crypto@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>