summary refs log tree commit diff
path: root/sysdeps/x86_64/multiarch/ifunc-impl-list.c
Commit message (Collapse)AuthorAgeFilesLines
* Update copyright dates with scripts/update-copyrightsJoseph Myers2023-01-061-1/+1
|
* x86: Add avx2 optimized functions for the wchar_t strcpy familyNoah Goldstein2022-11-081-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implemented: wcscat-avx2 (+ 744 bytes wcscpy-avx2 (+ 539 bytes) wcpcpy-avx2 (+ 577 bytes) wcsncpy-avx2 (+1108 bytes) wcpncpy-avx2 (+1214 bytes) wcsncat-avx2 (+1085 bytes) Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Best Old Implementation. Best Old Implementation was determined with the highest ISA implementation. wcscat-avx2 -> 0.975 wcscpy-avx2 -> 0.591 wcpcpy-avx2 -> 0.698 wcsncpy-avx2 -> 0.730 wcpncpy-avx2 -> 0.711 wcsncat-avx2 -> 0.954 Code Size Changes: This change increase the size of libc.so by ~5.5kb bytes. For reference the patch optimizing the normal strcpy family functions decreases libc.so by ~5.2kb. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.
* x86: Add evex optimized functions for the wchar_t strcpy familyNoah Goldstein2022-11-081-3/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implemented: wcscat-evex (+ 905 bytes) wcscpy-evex (+ 674 bytes) wcpcpy-evex (+ 709 bytes) wcsncpy-evex (+1358 bytes) wcpncpy-evex (+1467 bytes) wcsncat-evex (+1213 bytes) Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Best Old Implementation. Best Old Implementation was determined with the highest ISA implementation. wcscat-evex -> 0.991 wcscpy-evex -> 0.587 wcpcpy-evex -> 0.695 wcsncpy-evex -> 0.719 wcpncpy-evex -> 0.694 wcsncat-evex -> 0.979 Code Size Changes: This change increase the size of libc.so by ~6.3kb bytes. For reference the patch optimizing the normal strcpy family functions decreases libc.so by ~5.7kb. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.
* x86_64: Implement evex512 version of strrchr and wcsrchrSunil K Pandey2022-11-031-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | Changes from v1: Use vec api for register. Replace VPCMP with VPCMPEQ Restructure and remove 1 unconditional jump. Change page cross logic to use sall. This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - strrchr function using 512 bit vectors. - wcsrchr function using 512 bit vectors. Code size data: strrchr-evex.o 879 byte strrchr-evex512.o 601 byte (-32%) wcsrchr-evex.o 882 byte wcsrchr-evex512.o 572 byte (-35%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86_64: Implement evex512 version of strchrnul, strchr and wcschrSunil K Pandey2022-10-251-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - strchrnul function using 512 bit vectors. - strchr function using 512 bit vectors. - wcschr function using 512 bit vectors. Code size data: strchrnul-evex.o 599 byte strchrnul-evex512.o 569 byte (-5%) strchr-evex.o 639 byte strchr-evex512.o 595 byte (-7%) wcschr-evex.o 644 byte wcschr-evex512.o 607 byte (-6%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86_64: Implement evex512 version of memchr, rawmemchr and wmemchrSunil K Pandey2022-10-181-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - memchr function using 512 bit vectors. - rawmemchr function using 512 bit vectors. - wmemchr function using 512 bit vectors. Code size data: memchr-evex.o 762 byte memchr-evex512.o 576 byte (-24%) rawmemchr-evex.o 461 byte rawmemchr-evex512.o 412 byte (-11%) wmemchr-evex.o 794 byte wmemchr-evex512.o 552 byte (-30%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementationsAurelien Jarno2022-10-031-3/+14
| | | | | | | | | | | The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk' instruction which belongs to the BMI1 CPU feature and the 'shrx' instruction, which belongs to the BMI2 CPU feature. Fixes: df7e295d18ff ("x86: Optimize {str|wcs}rchr-avx2") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementationAurelien Jarno2022-10-031-2/+8
| | | | | | | | | | | The AVX2 memrchr implementation uses the 'shlxl' instruction, which belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which belongs to the LZCNT CPU feature. Fixes: af5306a735eb ("x86: Optimize memrchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 (raw|w)memchr implementationsAurelien Jarno2022-10-031-3/+9
| | | | | | | | | | The AVX2 memchr, rawmemchr and wmemchr implementations use the 'bzhi' and 'sarx' instructions, which belongs to the BMI2 CPU feature. Fixes: acfd088a1963 ("x86: Optimize memchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 wcs(n)cmp implementationsAurelien Jarno2022-10-031-2/+6
| | | | | | | | | | | | | | The AVX2 wcs(n)cmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 strncmp implementationAurelien Jarno2022-10-031-2/+5
| | | | | | | | | | | | | | The AVX2 strncmp implementations uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 strcmp implementationAurelien Jarno2022-10-031-1/+3
| | | | | | | | | | | | | | The AVX2 strcmp implementation uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for AVX2 str(n)casecmp implementationsAurelien Jarno2022-10-031-8/+20
| | | | | | | | | | | | | | The AVX2 str(n)casecmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Add support to build st{p|r}{n}{cpy|cat} with explicit ISA levelNoah Goldstein2022-07-161-72/+111
| | | | | | | | | | | | | | | | | | | | 1. Add default ISA level selection in non-multiarch/rtld implementations. 2. Add ISA level build guards to different implementations. - I.e strcpy-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (strcpy-evex.S). 3. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add support to build wcscpy with explicit ISA levelNoah Goldstein2022-07-161-3/+9
| | | | | | | | | | | | | | | | | | 1. Add ISA level build guards to different implementations. - wcscpy-ssse3.S is used as ISA level 2/3/4. - wcscpy-generic.c is only used at ISA level 1 and will only build if compiled with ISA level == 1. Otherwise there is no reason to include it as we will always use wcscpy-ssse3.S 2. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add support to build strcmp/strlen/strchr with explicit ISA levelNoah Goldstein2022-07-161-289/+359
| | | | | | | | | | | | | | | | | | | | 1. Add default ISA level selection in non-multiarch/rtld implementations. 2. Add ISA level build guards to different implementations. - I.e strcmp-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (strcmp-evex.S). 3. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Remove generic strncat, strncpy, and stpncpy implementationsNoah Goldstein2022-07-121-6/+3
| | | | | | | | | | | | | | | | | | These functions all have optimized versions: __strncat_sse2_unaligned, __strncpy_sse2_unaligned, and stpncpy_sse2_unaligned which are faster than their respective generic implementations. Since the sse2 versions can run on baseline x86_64, we should use these as the baseline implementation and can remove the generic implementations. Geometric mean of N=20 runs of the entire benchmark suite on: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake) __strncat_sse2_unaligned / __strncat_generic: .944 __strncpy_sse2_unaligned / __strncpy_generic: .726 __stpncpy_sse2_unaligned / __stpncpy_generic: .650 Tested build with and without multiarch and full check with multiarch.
* x86: Add support for building {w}memcmp{eq} with explicit ISA levelNoah Goldstein2022-07-051-51/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Refactor files so that all implementations are in the multiarch directory - Moved the implementation portion of memcmp sse2 from memcmp.S to multiarch/memcmp-sse2.S - The non-multiarch file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memcmp-avx2-movsb.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memcmp-evex-movbe.S). 3. Add new multiarch/rtld-{w}memcmp{eq}.S that just include the non-multiarch {w}memcmp{eq}.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add support for building {w}memset{_chk} with explicit ISA levelNoah Goldstein2022-07-051-120/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Refactor files so that all implementations are in the multiarch directory - Moved the implementation portion of memset sse2 from memset.S to multiarch/memset-sse2.S - The non-multiarch file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memset-avx2-unaligned-erms.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memset-evex-unaligned-erms.S). 3. Add new multiarch/rtld-memset.S that just include the non-multiarch memset.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add support for building {w}memmove{_chk} with explicit ISA levelNoah Goldstein2022-07-051-215/+252
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Refactor files so that all implementations are in the multiarch directory - Moved the implementation portion of memmove sse2 from memmove.S to multiarch/memmove-sse2.S - The non-multiarch file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memmove-avx2-unaligned-erms.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memmove-evex-unaligned-erms.S). 3. Add new multiarch/rtld-memmove.S that just include the non-multiarch memmove.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch. isa raising memmove
* x86: Add support for building str{c|p}{brk|spn} with explicit ISA levelNoah Goldstein2022-07-051-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changes for these functions are different than the others because the best implementation (sse4_2) requires the generic implementation as a fallback to be built as well. Changes are: 1. Add non-multiarch functions for str{c|p}{brk|spn}.c to statically select the best implementation based on the configured ISA build level. 2. Add stubs for str{c|p}{brk|spn}-generic and varshift.c to in the sysdeps/x86_64 directory so that the the sse4 implementation will have all of its dependencies for the non-multiarch / rtld build when ISA level >= 2. 3. Add new multiarch/rtld-strcspn.c that just include the non-multiarch strcspn.c which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86-64: Properly indent X86_IFUNC_IMPL_ADD_VN argumentsH.J. Lu2022-06-291-48/+51
| | | | | | | Properly indent X86_IFUNC_IMPL_ADD_VN arguments for memchr, rawmemchr and wmemchr. Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Add definition for __wmemset_chk AVX2 RTM in ifunc impl listNoah Goldstein2022-06-291-0/+4
| | | | This was simply missing and meant we weren't testing it properly.
* x86: Rename strstr_sse2 to strstr_generic as it uses string/strstr.cNoah Goldstein2022-06-271-1/+1
| | | | This is in accordance with other files in the multiarch directory.
* x86: Add support for compiling {raw|w}memchr with high ISA levelNoah Goldstein2022-06-221-31/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Refactor files so that all implementations for in the multiarch directory. - Essentially moved sse2 {raw|w}memchr.S implementation to multiarch/{raw|w}memchr-sse2.S - The non-multiarch {raw|w}memchr.S file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memchr-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memchr-evex{-rtm}.S). 3. Add new multiarch/rtld-{raw}memchr.S that just include the non-multiarch {raw}memchr.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. - Guranteed replacement essentially means that for any ISA level build there must be a function that the baseline of the ISA supports. So for {raw|w}memchr.S since there is not ISA level 2 function, the ISA level 2 build still includes the ISA level 1 (sse2) function. Once we reach the ISA level 3 build, however, {raw|w}memchr-avx2{-rtm}.S will always be sufficient so the ISA level 1 implementation ({raw|w}memchr-sse2.S) will not be built. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Rename generic functions with unique postfix for clarityNoah Goldstein2022-06-161-9/+9
| | | | | | | | | | No functions are changed. It just renames generic implementations from '{func}_sse2' to '{func}_generic'. This is just because the postfix "_sse2" was overloaded and was used for files that had hand-optimized sse2 assembly implementations and files that just redirected back to the generic implementation. Full xcheck passed on x86_64.
* Add bounds check to __libc_ifunc_impl_listWilco Dijkstra2022-06-101-7/+2
| | | | | | | | | | | | Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC redundant and fixes several targets that will write outside the array. To avoid unnecessary large diffs, pass the maximum in the argument 'i' to IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing ones can be updated if desired. Passes buildmanyglibc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86_64: Add strstr function with 512-bit EVEXRaghuveer Devulapalli2022-06-061-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding a 512-bit EVEX version of strstr. The algorithm works as follows: (1) We spend a few cycles at the begining to peek into the needle. We locate an edge in the needle (first occurance of 2 consequent distinct characters) and also store the first 64-bytes into a zmm register. (2) We search for the edge in the haystack by looking into one cache line of the haystack at a time. This avoids having to read past a page boundary which can cause a seg fault. (3) If an edge is found in the haystack we first compare the first 64-bytes of the needle (already stored in a zmm register) before we proceed with a full string compare performed byte by byte. Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512) Geometric mean of all benchmarks: new / old = 0.66 Difficult skiptable(0) : new / old = 0.02 Difficult skiptable(1) : new / old = 0.01 Difficult 2-way : new / old = 0.25 Difficult testing first 2 : new / old = 1.26 Difficult skiptable(0) : new / old = 0.05 Difficult skiptable(1) : new / old = 0.06 Difficult 2-way : new / old = 0.26 Difficult testing first 2 : new / old = 1.05 Difficult skiptable(0) : new / old = 0.42 Difficult skiptable(1) : new / old = 0.24 Difficult 2-way : new / old = 0.21 Difficult testing first 2 : new / old = 1.04 Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlenSunil K Pandey2022-05-261-0/+20
| | | | | | | | | | | | | | | This patch implements following evex512 version of string functions. Perf gain for evex512 version is up to 50% as compared to evex, depending on length and alignment. Placeholder function, not used by any processor at the moment. - String length function using 512 bit vectors. - String N length using 512 bit vectors. - Wide string length using 512 bit vectors. - Wide string N length using 512 bit vectors. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86_64: Remove bzero optimizationAdhemerval Zanella2022-05-161-42/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both symbols are marked as legacy in POSIX.1-2001 and removed on POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE or _DEFAULT_SOURCE. GCC also replaces bcopy with a memmove and bzero with memset on default configuration (to actually get a bzero libc call the code requires to omit string.h inclusion and built with -fno-builtin), so it is highly unlikely programs are actually calling libc bzero symbol. On a recent Linux distro (Ubuntu 22.04), there is no bzero calls by the installed binaries. $ cat count_bstring.sh #!/bin/bash files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done` total=0 for file in $files; do symbols=`objdump -R $file 2>&1` if [ $? -eq 0 ]; then ncalls=`echo $symbols | grep -w $1 | wc -l` ((total=total+ncalls)) if [ $ncalls -gt 0 ]; then echo "$file: $ncalls" fi fi done echo "TOTAL=$total" $ ./count_bstring.sh bzero TOTAL=0 Checked on x86_64-linux-gnu.
* x86: Remove memcmp-sse4.SNoah Goldstein2022-04-151-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code didn't actually use any sse4 instructions since `ptest` was removed in: commit 2f9062d7171850451e6044ef78d91ff8c017b9c0 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Nov 10 16:18:56 2021 -0600 x86: Shrink memcmp-sse4.S code size The new memcmp-sse2 implementation is also faster. geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905 Note there are two regressions preferring SSE2 for Size = 1 and Size = 65. Size = 1: size, align0, align1, ret, New Time/Old Time 1, 1, 1, 0, 1.2 1, 1, 1, 1, 1.197 1, 1, 1, -1, 1.2 This is intentional. Size == 1 is significantly less hot based on profiles of GCC11 and Python3 than sizes [4, 8] (which is made hotter). Python3 Size = 1 -> 13.64% Python3 Size = [4, 8] -> 60.92% GCC11 Size = 1 -> 1.29% GCC11 Size = [4, 8] -> 33.86% size, align0, align1, ret, New Time/Old Time 4, 4, 4, 0, 0.622 4, 4, 4, 1, 0.797 4, 4, 4, -1, 0.805 5, 5, 5, 0, 0.623 5, 5, 5, 1, 0.777 5, 5, 5, -1, 0.802 6, 6, 6, 0, 0.625 6, 6, 6, 1, 0.813 6, 6, 6, -1, 0.788 7, 7, 7, 0, 0.625 7, 7, 7, 1, 0.799 7, 7, 7, -1, 0.795 8, 8, 8, 0, 0.625 8, 8, 8, 1, 0.848 8, 8, 8, -1, 0.914 9, 9, 9, 0, 0.625 Size = 65: size, align0, align1, ret, New Time/Old Time 65, 0, 0, 0, 1.103 65, 0, 0, 1, 1.216 65, 0, 0, -1, 1.227 65, 65, 0, 0, 1.091 65, 0, 65, 1, 1.19 65, 65, 65, -1, 1.215 This is because A) the checks in range [65, 96] are now unrolled 2x and B) because smaller values <= 16 are now given a hotter path. By contrast the SSE4 version has a branch for Size = 80. The unrolled version has get better performance for returns which need both comparisons. size, align0, align1, ret, New Time/Old Time 128, 4, 8, 0, 0.858 128, 4, 8, 1, 0.879 128, 4, 8, -1, 0.888 As well, out of microbenchmark environments that are not full predictable the branch will have a real-cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove mem{move|cpy}-ssse3-backNoah Goldstein2022-04-141-15/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove str{p}{n}cpy-ssse3Noah Goldstein2022-04-141-8/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove str{n}cat-ssse3Noah Goldstein2022-04-141-4/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove str{n}{case}cmp-ssse3Noah Goldstein2022-04-141-16/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove {w}memcmp-ssse3Noah Goldstein2022-04-141-4/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove AVX str{n}casecmpNoah Goldstein2022-03-251-12/+0
| | | | | | | | | | | | | | | | | The rational is: 1. SSE42 has nearly identical logic so any benefit is minimal (3.4% regression on Tigerlake using SSE42 versus AVX across the benchtest suite). 2. AVX2 version covers the majority of targets that previously prefered it. 3. The targets where AVX would still be best (SnB and IVB) are becoming outdated. All in all the saving the code size is worth it. All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Add EVEX optimized str{n}casecmpNoah Goldstein2022-03-251-0/+16
| | | | | | | geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Add AVX2 optimized str{n}casecmpNoah Goldstein2022-03-251-0/+28
| | | | | | | geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Optimize bzeroH.J. Lu2022-02-081-0/+42
| | | | | | | | | | memset with zero as the value to set is by far the majority value (99%+ for Python3 and GCC). bzero can be slightly more optimized for this case by using a zero-idiom xor for broadcasting the set value to a register (vector or GPR). Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* Update copyright dates with scripts/update-copyrightsPaul Eggert2022-01-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 7061 files FOO. I then removed trailing white space from math/tgmath.h, support/tst-support-open-dev-null-range.c, and sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following obscure pre-commit check failure diagnostics from Savannah. I don't know why I run into these diagnostics whereas others evidently do not. remote: *** 912-#endif remote: *** 913: remote: *** 914- remote: *** error: lines with trailing whitespace found ... remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
* x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.SNoah Goldstein2021-10-271-1/+0
| | | | | | | | | | | | | No bug. This commit adds new optimized __memcmpeq implementation for evex. The primary optimizations are: 1) skipping the logic to find the difference of the first mismatched byte. 2) not updating src/dst addresses as the non-equals logic does not need to be reused by different areas.
* x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.SNoah Goldstein2021-10-271-2/+0
| | | | | | | | | | | | | No bug. This commit adds new optimized __memcmpeq implementation for avx2. The primary optimizations are: 1) skipping the logic to find the difference of the first mismatched byte. 2) not updating src/dst addresses as the non-equals logic does not need to be reused by different areas.
* x86_64: Add support for __memcmpeq using sse2, avx2, and evexNoah Goldstein2021-10-271-0/+21
| | | | | | No bug. This commit adds support for __memcmpeq to be implemented seperately from memcmp. Support is added for versions optimized with sse2, avx2, and evex.
* x86: Remove wcsnlen-sse4_1 from wcslen ifunc-impl-list [BZ #28064]Noah Goldstein2021-07-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | The following commit commit 6f573a27b6c8b4236445810a44660612323f5a73 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:19:34 2021 -0400 x86-64: Add wcslen optimize for sse4.1 Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc implementation list and adding wcslen-sse4.1 to the ifunc implementation list. Testing: test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as well as all other tests in wcsmbs and string. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add wcslen optimize for sse4.1Noah Goldstein2021-06-231-0/+3
| | | | | | | | | No bug. This comment adds the ifunc / build infrastructure necessary for wcslen to prefer the sse4.1 implementation in strlen-vec.S. test-wcslen.c is passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memcmp-avx2-movbe.SNoah Goldstein2021-05-181-0/+6
| | | | | | | | | | No bug. This commit optimizes memcmp-avx2.S. The optimizations include adding a new vec compare path for small sizes, reorganizing the entry control flow, and removing some unnecissary ALU instructions from the main loop. test-memcmp and test-wmemcmp are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Add EVEX optimized memchr family not safe for RTMNoah Goldstein2021-05-081-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. This commit adds a new implementation for EVEX memchr that is not safe for RTM because it uses vzeroupper. The benefit is that by using ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is faster than the RTM safe version which cannot use vpcmpeq because there is no EVEX encoding for the instruction. All parts of the implementation aside from the 4x loop are the same for the two versions and the optimization is only relevant for large sizes. Tigerlake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 9.2 , 9.04 , no-RTM , 0.16 512 , 7 , 224 , 9.19 , 8.98 , no-RTM , 0.21 2048 , 0 , 256 , 10.74 , 10.54 , no-RTM , 0.2 2048 , 0 , 512 , 14.81 , 14.87 , RTM , 0.06 2048 , 0 , 1024 , 22.97 , 22.57 , no-RTM , 0.4 2048 , 0 , 2048 , 37.49 , 34.51 , no-RTM , 2.98 <-- Icelake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 7.6 , 7.3 , no-RTM , 0.3 512 , 7 , 224 , 7.63 , 7.27 , no-RTM , 0.36 2048 , 0 , 256 , 8.48 , 8.38 , no-RTM , 0.1 2048 , 0 , 512 , 11.57 , 11.42 , no-RTM , 0.15 2048 , 0 , 1024 , 17.92 , 17.38 , no-RTM , 0.54 2048 , 0 , 2048 , 30.37 , 27.34 , no-RTM , 3.03 <-- test-memchr, test-wmemchr, and test-rawmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize strlen-avx2.SNoah Goldstein2021-04-191-4/+12
| | | | | | | | | | No bug. This commit optimizes strlen-avx2.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.SNoah Goldstein2021-04-191-12/+28
| | | | | | | | No bug. This commit adds optimized cased for less_vec memset case that uses the avx512vl/avx512bw mask store avoiding the excessive branches. test-memset and test-wmemset are passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>