about summary refs log tree commit diff
path: root/sysdeps/x86_64
Commit message (Collapse)AuthorAgeFilesLines
...
* x86: Optimize memchr-evex.S and implement with VMM headersNoah Goldstein2022-10-193-410/+851
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimizations are: 1. Use the fact that tzcnt(0) -> VEC_SIZE for memchr to save a branch in short string case. 2. Restructure code so that small strings are given the hot path. - This is a net-zero on the benchmark suite but in general makes sense as smaller sizes are far more common. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 4. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. The optimizations (especially for point 2) make the memchr and rawmemchr code essentially incompatible so split rawmemchr-evex to a new file. Code Size Changes: memchr-evex.S : -107 bytes rawmemchr-evex.S : -53 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. memchr-evex.S : 0.928 rawmemchr-evex.S : 0.986 (Less targets cross cache lines) Full results attached in email. Full check passes on x86-64.
* x86_64: Implement evex512 version of memchr, rawmemchr and wmemchrSunil K Pandey2022-10-186-0/+346
| | | | | | | | | | | | | | | | | | | | | | | | | This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - memchr function using 512 bit vectors. - rawmemchr function using 512 bit vectors. - wmemchr function using 512 bit vectors. Code size data: memchr-evex.o 762 byte memchr-evex512.o 576 byte (-24%) rawmemchr-evex.o 461 byte rawmemchr-evex512.o 412 byte (-11%) wmemchr-evex.o 794 byte wmemchr-evex512.o 552 byte (-30%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* Use PTR_MANGLE and PTR_DEMANGLE unconditionally in C sourcesFlorian Weimer2022-10-181-2/+0
| | | | | | | | | | | | | | | | | In the future, this will result in a compilation failure if the macros are unexpectedly undefined (due to header inclusion ordering or header inclusion missing altogether). Assembler sources are more difficult to convert. In many cases, they are hand-optimized for the mangling and no-mangling variants, which is why they are not converted. sysdeps/s390/s390-32/__longjmp.c and sysdeps/s390/s390-64/__longjmp.c are special: These are C sources, but most of the implementation is in assembler, so the PTR_DEMANGLE macro has to be undefined in some cases, to match the assembler style. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* Introduce <pointer_guard.h>, extracted from <sysdep.h>Florian Weimer2022-10-183-0/+3
| | | | | | | | | | | | | | This allows us to define a generic no-op version of PTR_MANGLE and PTR_DEMANGLE. In the future, we can use PTR_MANGLE and PTR_DEMANGLE unconditionally in C sources, avoiding an unintended loss of hardening due to missing include files or unlucky header inclusion ordering. In i386 and x86_64, we can avoid a <tls.h> dependency in the C code by using the computed constant from <tcb-offsets.h>. <sysdep.h> no longer includes these definitions, so there is no cyclic dependency anymore when computing the <tcb-offsets.h> constants. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86-64: Move LP_SIZE definition to its own headerFlorian Weimer2022-10-184-11/+48
| | | | | | | | | This way, we can define the pointer guard macros without including <sysdep.h> on x86-64. Other architectures will not have such an inclusion dependency, and the implied header file inclusion would create a porting hazard. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86: Update strlen-evex-base to use new reg/vec macros.Noah Goldstein2022-10-142-76/+44
| | | | | | | | | | To avoid duplicate the VMM / GPR / mask insn macros in all incoming evex512 files use the macros defined in 'reg-macros.h' and '{vec}-macros.h' This commit does not change libc.so Tested build on x86-64
* x86: Remove now unused vec header macros.Noah Goldstein2022-10-147-328/+0
| | | | | | This commit does not change libc.so Tested build on x86-64
* x86: Update memset to use new VEC macrosNoah Goldstein2022-10-146-99/+43
| | | | | | | | Replace %VEC(n) -> %VMM(n) This commit does not change libc.so Tested build on x86-64
* x86: Update memmove to use new VEC macrosNoah Goldstein2022-10-146-221/+132
| | | | | | | | Replace %VEC(n) -> %VMM(n) This commit does not change libc.so Tested build on x86-64
* x86: Update memrchr to use new VEC macrosNoah Goldstein2022-10-141-21/+21
| | | | | | | | Replace %VEC(n) -> %VMM(n) This commit does not change libc.so Tested build on x86-64
* x86: Update VEC macros to complete API for evex/evex512 implsNoah Goldstein2022-10-149-0/+635
| | | | | | | | | | | | | | | | | | | | | | | 1) Copy so that backport will be easier. 2) Make section only define if there is not a previous definition 3) Add `VEC_lo` definition for proper reg-width but in the ymm/zmm0-15 range. 4) Add macros for accessing GPRs based on VEC_SIZE This is to make it easier to do think like: ``` vpcmpb %VEC(0), %VEC(1), %k0 kmov{d|q} %k0, %{eax|rax} test %{eax|rax} ``` It adds macro s.t any GPR can get the proper width with: `V{upcase_GPR_name}` and any mask insn can get the proper width with: `{upcase_mask_insn_without_postfix}` This commit does not change libc.so Tested build on x86-64
* elf: Remove -fno-tree-loop-distribute-patterns usage on dl-supportAdhemerval Zanella2022-10-101-0/+34
| | | | | | | | | | | | | | Besides the option being gcc specific, this approach is still fragile and not future proof since we do not know if this will be the only optimization option gcc will add that transforms loops to memset (or any libcall). This patch adds a new header, dl-symbol-redir-ifunc.h, that can b used to redirect the compiler generated libcalls to port the generic memset implementation if required. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* x86_64: Remove platform directory library loading testJavier Pello2022-10-063-64/+0
| | | | | | | | | This was to test loading of shared libraries from platform subdirectories, but this functionality is going away in the following commits. Signed-off-by: Javier Pello <devel@otheo.eu> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86: Fix -Os build (BZ #29576)Adhemerval Zanella Netto2022-10-051-0/+18
| | | | | | | | | | | The compiler might transform __stpcpy calls (which are routed to __builtin_stpcpy as an optimization) to strcpy and x86_64 strcpy multiarch implementation does not build any working symbol due ISA_SHOULD_BUILD not being evaluated for IS_IN(rtld). Checked on x86_64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
* x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementationsAurelien Jarno2022-10-032-3/+15
| | | | | | | | | | | The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk' instruction which belongs to the BMI1 CPU feature and the 'shrx' instruction, which belongs to the BMI2 CPU feature. Fixes: df7e295d18ff ("x86: Optimize {str|wcs}rchr-avx2") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementationAurelien Jarno2022-10-032-2/+9
| | | | | | | | | | | The AVX2 memrchr implementation uses the 'shlxl' instruction, which belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which belongs to the LZCNT CPU feature. Fixes: af5306a735eb ("x86: Optimize memrchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 (raw|w)memchr implementationsAurelien Jarno2022-10-031-3/+9
| | | | | | | | | | The AVX2 memchr, rawmemchr and wmemchr implementations use the 'bzhi' and 'sarx' instructions, which belongs to the BMI2 CPU feature. Fixes: acfd088a1963 ("x86: Optimize memchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 wcs(n)cmp implementationsAurelien Jarno2022-10-031-2/+6
| | | | | | | | | | | | | | The AVX2 wcs(n)cmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 strncmp implementationAurelien Jarno2022-10-032-4/+7
| | | | | | | | | | | | | | The AVX2 strncmp implementations uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 strcmp implementationAurelien Jarno2022-10-032-3/+5
| | | | | | | | | | | | | | The AVX2 strcmp implementation uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 str(n)casecmp implementationsAurelien Jarno2022-10-032-8/+21
| | | | | | | | | | | | | | The AVX2 str(n)casecmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Cleanup pthread_spin_{try}lock.SNoah Goldstein2022-10-032-12/+29
| | | | | | | | | | | | | Save a jmp on the lock path coming from an initial failure in pthread_spin_lock.S. This costs 4-bytes of code but since the function still fits in the same number of 16-byte blocks (default function alignment) it does not have affect on the total binary size of libc.so (unchanged after this commit). pthread_spin_trylock was using a CAS when a simple xchg works which is often more expensive. Full check passes on x86-64.
* x86: Remove .tfloat usageAdhemerval Zanella2022-10-031-1/+2
| | | | | Some compiler does not support it (such as clang integrated assembler) neither gcc emits it.
* x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591]Noah Goldstein2022-09-281-5/+2
| | | | | | | | | | | Previous implementation was adjusting length (rsi) to match bytes (eax), but since there is no bound to length this can cause overflow. Fix is to just convert the byte-count (eax) to length by dividing by sizeof (wchar_t) before the comparison. Full check passes on x86-64 and build succeeds w/ and w/o multiarch.
* nptl: x86_64: Use same code for CURRENT_STACK_FRAME and stackinfo_get_spAdhemerval Zanella2022-08-311-1/+3
| | | | | | | | | | | | | It avoids the possible warning of uninitialized 'frame' variable when building with clang: ../sysdeps/nptl/jmp-unwind.c:27:42: error: variable 'frame' is uninitialized when used here [-Werror,-Wuninitialized] __pthread_cleanup_upto (env->__jmpbuf, CURRENT_STACK_FRAME); The resulting code is similar to CURRENT_STACK_FRAME. Checked on x86_64-linux-gnu.
* x86: Fix `#define STRCPY` guard in strcpy-sse2.SNoah Goldstein2022-08-091-1/+1
| | | | | | `#ifndef STPCPY` is incorrect for checking if `STRCPY` is already defined. It doesn't end up mattering as the whole check is guarded by `#if IS_IN (libc)` but is incorrect none the less.
* arc4random: simplify design for better safetyJason A. Donenfeld2022-07-274-701/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than buffering 16 MiB of entropy in userspace (by way of chacha20), simply call getrandom() every time. This approach is doubtlessly slower, for now, but trying to prematurely optimize arc4random appears to be leading toward all sorts of nasty properties and gotchas. Instead, this patch takes a much more conservative approach. The interface is added as a basic loop wrapper around getrandom(), and then later, the kernel and libc together can work together on optimizing that. This prevents numerous issues in which userspace is unaware of when it really must throw away its buffer, since we avoid buffering all together. Future improvements may include userspace learning more from the kernel about when to do that, which might make these sorts of chacha20-based optimizations more possible. The current heuristic of 16 MiB is meaningless garbage that doesn't correspond to anything the kernel might know about. So for now, let's just do something conservative that we know is correct and won't lead to cryptographic issues for users of this function. This patch might be considered along the lines of, "optimization is the root of all evil," in that the much more complex implementation it replaces moves too fast without considering security implications, whereas the incremental approach done here is a much safer way of going about things. Once this lands, we can take our time in optimizing this properly using new interplay between the kernel and userspace. getrandom(0) is used, since that's the one that ensures the bytes returned are cryptographically secure. But on systems without it, we fallback to using /dev/urandom. This is unfortunate because it means opening a file descriptor, but there's not much of a choice. Secondly, as part of the fallback, in order to get more or less the same properties of getrandom(0), we poll on /dev/random, and if the poll succeeds at least once, then we assume the RNG is initialized. This is a rough approximation, as the ancient "non-blocking pool" initialized after the "blocking pool", not before, and it may not port back to all ancient kernels, though it does to all kernels supported by glibc (≥3.2), so generally it's the best approximation we can do. The motivation for including arc4random, in the first place, is to have source-level compatibility with existing code. That means this patch doesn't attempt to litigate the interface itself. It does, however, choose a conservative approach for implementing it. Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Cristian Rodríguez <crrodriguez@opensuse.org> Cc: Paul Eggert <eggert@cs.ucla.edu> Cc: Mark Harris <mark.hsj@gmail.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: linux-crypto@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86: Add AVX2 optimized chacha20Adhemerval Zanella Netto2022-07-224-5/+356
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It adds vectorized ChaCha20 implementation based on libgcrypt cipher/chacha20-amd64-avx2.S. It is used only if AVX2 is supported and enabled by the architecture. As for generic implementation, the last step that XOR with the input is omited. The final state register clearing is also omitted. On a Ryzen 9 5900X it shows the following improvements (using formatted bench-arc4random data): SSE MB/s ----------------------------------------------- arc4random [single-thread] 704.25 arc4random_buf(16) [single-thread] 1018.17 arc4random_buf(32) [single-thread] 1315.27 arc4random_buf(48) [single-thread] 1449.36 arc4random_buf(64) [single-thread] 1511.16 arc4random_buf(80) [single-thread] 1539.48 arc4random_buf(96) [single-thread] 1571.06 arc4random_buf(112) [single-thread] 1596.16 arc4random_buf(128) [single-thread] 1613.48 ----------------------------------------------- AVX2 MB/s ----------------------------------------------- arc4random [single-thread] 922.61 arc4random_buf(16) [single-thread] 1478.70 arc4random_buf(32) [single-thread] 2241.80 arc4random_buf(48) [single-thread] 2681.28 arc4random_buf(64) [single-thread] 2913.43 arc4random_buf(80) [single-thread] 3009.73 arc4random_buf(96) [single-thread] 3141.16 arc4random_buf(112) [single-thread] 3254.46 arc4random_buf(128) [single-thread] 3305.02 ----------------------------------------------- Checked on x86_64-linux-gnu.
* x86: Add SSE2 optimized chacha20Adhemerval Zanella Netto2022-07-223-0/+350
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It adds vectorized ChaCha20 implementation based on libgcrypt cipher/chacha20-amd64-ssse3.S. It replaces the ROTATE_SHUF_2 (which uses pshufb) by ROTATE2 and thus making the original implementation SSE2. As for generic implementation, the last step that XOR with the input is omited. The final state register clearing is also omitted. On a Ryzen 9 5900X it shows the following improvements (using formatted bench-arc4random data): GENERIC MB/s ----------------------------------------------- arc4random [single-thread] 443.11 arc4random_buf(16) [single-thread] 552.27 arc4random_buf(32) [single-thread] 626.86 arc4random_buf(48) [single-thread] 649.81 arc4random_buf(64) [single-thread] 663.95 arc4random_buf(80) [single-thread] 674.78 arc4random_buf(96) [single-thread] 675.17 arc4random_buf(112) [single-thread] 680.69 arc4random_buf(128) [single-thread] 683.20 ----------------------------------------------- SSE MB/s ----------------------------------------------- arc4random [single-thread] 704.25 arc4random_buf(16) [single-thread] 1018.17 arc4random_buf(32) [single-thread] 1315.27 arc4random_buf(48) [single-thread] 1449.36 arc4random_buf(64) [single-thread] 1511.16 arc4random_buf(80) [single-thread] 1539.48 arc4random_buf(96) [single-thread] 1571.06 arc4random_buf(112) [single-thread] 1596.16 arc4random_buf(128) [single-thread] 1613.48 ----------------------------------------------- Checked on x86_64-linux-gnu.
* x86: Add support to build st{p|r}{n}{cpy|cat} with explicit ISA levelNoah Goldstein2022-07-1630-158/+382
| | | | | | | | | | | | | | | | | | | | 1. Add default ISA level selection in non-multiarch/rtld implementations. 2. Add ISA level build guards to different implementations. - I.e strcpy-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (strcpy-evex.S). 3. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add support to build wcscpy with explicit ISA levelNoah Goldstein2022-07-168-12/+103
| | | | | | | | | | | | | | | | | | 1. Add ISA level build guards to different implementations. - wcscpy-ssse3.S is used as ISA level 2/3/4. - wcscpy-generic.c is only used at ISA level 1 and will only build if compiled with ISA level == 1. Otherwise there is no reason to include it as we will always use wcscpy-ssse3.S 2. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add support to build strcmp/strlen/strchr with explicit ISA levelNoah Goldstein2022-07-1687-618/+1147
| | | | | | | | | | | | | | | | | | | | 1. Add default ISA level selection in non-multiarch/rtld implementations. 2. Add ISA level build guards to different implementations. - I.e strcmp-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (strcmp-evex.S). 3. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Remove unneeded rtld-wmemcmpNoah Goldstein2022-07-131-18/+0
| | | | | | | | | | wmemcmp isn't used by the dynamic loader so their no need to add an RTLD stub for it. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Move wcslen SSE2 implementation to multiarch/wcslen-sse2.SNoah Goldstein2022-07-132-219/+218
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move wcschr SSE2 implementation to multiarch/wcschr-sse2.SNoah Goldstein2022-07-132-142/+138
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move strcat SSE2 implementation to multiarch/strcat-sse2.SNoah Goldstein2022-07-132-243/+238
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move strchr SSE2 implementation to multiarch/strchr-sse2.SNoah Goldstein2022-07-136-183/+213
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move strrchr SSE2 implementation to multiarch/strrchr-sse2.SNoah Goldstein2022-07-134-377/+366
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move memrchr SSE2 implementation to multiarch/memrchr-sse2.SNoah Goldstein2022-07-132-334/+334
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move strcpy SSE2 implementation to multiarch/strcpy-sse2.SNoah Goldstein2022-07-135-155/+156
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move strlen SSE2 implementation to multiarch/strlen-sse2.SNoah Goldstein2022-07-139-286/+306
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move strcmp SSE42 implementation to multiarch/strcmp-sse4_2.SNoah Goldstein2022-07-135-1792/+1766
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move wcscmp SSE2 implementation to multiarch/wcscmp-sse2.SNoah Goldstein2022-07-132-934/+934
| | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Move strcmp SSE2 implementation to multiarch/strcmp-sse2.SNoah Goldstein2022-07-1311-2178/+2264
| | | | | | | | | | | | This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Because strcmp-sse2.S implements so many functions (more from avx2/evex/sse42) add a new file 'strcmp-naming.h' to assist in getting the correct symbol name for all the function across multiarch/non-multiarch builds. Tested build on x86_64 and x86_32 with/without multiarch.
* x86: Rename STRCASECMP_NONASCII macro to STRCASECMP_L_NONASCIINoah Goldstein2022-07-132-6/+6
| | | | | | The previous macro name can be confusing given that both `__strcasecmp_l_nonascii` and `__strcasecmp_nonascii` are functions and we use the `_l` version.
* x86: Remove __mmask intrinsics in strstr-avx512.cNoah Goldstein2022-07-121-6/+10
| | | | | | | | | | | | The intrinsics are not available before GCC7 and using standard operators generates code of equivalent or better quality. Removed: _cvtmask64_u64 _kshiftri_mask64 _kand_mask64 Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958
* x86: Remove generic strncat, strncpy, and stpncpy implementationsNoah Goldstein2022-07-1210-92/+56
| | | | | | | | | | | | | | | | | | These functions all have optimized versions: __strncat_sse2_unaligned, __strncpy_sse2_unaligned, and stpncpy_sse2_unaligned which are faster than their respective generic implementations. Since the sse2 versions can run on baseline x86_64, we should use these as the baseline implementation and can remove the generic implementations. Geometric mean of N=20 runs of the entire benchmark suite on: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake) __strncat_sse2_unaligned / __strncat_generic: .944 __strncpy_sse2_unaligned / __strncpy_generic: .726 __stpncpy_sse2_unaligned / __stpncpy_generic: .650 Tested build with and without multiarch and full check with multiarch.
* x86-64: Remove redundant strcspn-generic/strpbrk-generic/strspn-genericH.J. Lu2022-07-081-3/+0
| | | | | | | | | | | | | Remove redundant strcspn-generic, strpbrk-generic and strspn-generic from sysdep_routines in sysdeps/x86_64/multiarch/Makefile added by commit c69f960b017b2cdf39335739009526a72fb20379 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Jul 3 21:28:07 2022 -0700 x86: Add support for building str{c|p}{brk|spn} with explicit ISA level since they have been added to sysdep_routines in sysdeps/x86_64/Makefile.
* x86-64: Don't mark symbols as hidden in strcmp-XXX.SH.J. Lu2022-07-073-3/+0
| | | | | Don't mark symbols as hidden in strcmp-avx2.S, strcmp-evex.S and strcmp-sse42.S since they are marked as hidden in the IFUNC selectors.
* x86: Add support for building {w}memcmp{eq} with explicit ISA levelNoah Goldstein2022-07-0519-660/+788
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Refactor files so that all implementations are in the multiarch directory - Moved the implementation portion of memcmp sse2 from memcmp.S to multiarch/memcmp-sse2.S - The non-multiarch file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memcmp-avx2-movsb.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memcmp-evex-movbe.S). 3. Add new multiarch/rtld-{w}memcmp{eq}.S that just include the non-multiarch {w}memcmp{eq}.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.