about summary refs log tree commit diff
path: root/sysdeps/x86_64/multiarch/Makefile
Commit message (Collapse)AuthorAgeFilesLines
* x86_64: Remove avx512 strstr implementationAdhemerval Zanella2024-03-271-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As indicated in a recent thread, this it is a simple brute-force algorithm that checks the whole needle at a matching character pair (and does so 1 byte at a time after the first 64 bytes of a needle). Also it never skips ahead and thus can match at every haystack position after trying to match all of the needle, which generic implementation avoids. As indicated by Wilco, a 4x larger needle and 16x larger haystack gives a clear 65x slowdown both basic_strstr and __strstr_avx512: "ifuncs": ["basic_strstr", "twoway_strstr", "__strstr_avx512", "__strstr_sse2_unaligned", "__strstr_generic"], { "len_haystack": 65536, "len_needle": 1024, "align_haystack": 0, "align_needle": 0, "fail": 1, "desc": "Difficult bruteforce needle", "timings": [4.0948e+07, 15094.5, 3.20818e+07, 108558, 10839.2] }, { "len_haystack": 1048576, "len_needle": 4096, "align_haystack": 0, "align_needle": 0, "fail": 1, "desc": "Difficult bruteforce needle", "timings": [2.69767e+09, 100797, 2.08535e+09, 495706, 82666.9] } PS: I don't have an AVX512 capable machine to verify this issues, but skimming through the code it does seems to follow what Wilco has described. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* Apply the Makefile sorting fixH.J. Lu2024-02-151-2/+2
| | | | Apply the Makefile sorting fix generated by sort-makefile-lines.py.
* x86: Add avx2 optimized functions for the wchar_t strcpy familyNoah Goldstein2022-11-081-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implemented: wcscat-avx2 (+ 744 bytes wcscpy-avx2 (+ 539 bytes) wcpcpy-avx2 (+ 577 bytes) wcsncpy-avx2 (+1108 bytes) wcpncpy-avx2 (+1214 bytes) wcsncat-avx2 (+1085 bytes) Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Best Old Implementation. Best Old Implementation was determined with the highest ISA implementation. wcscat-avx2 -> 0.975 wcscpy-avx2 -> 0.591 wcpcpy-avx2 -> 0.698 wcsncpy-avx2 -> 0.730 wcpncpy-avx2 -> 0.711 wcsncat-avx2 -> 0.954 Code Size Changes: This change increase the size of libc.so by ~5.5kb bytes. For reference the patch optimizing the normal strcpy family functions decreases libc.so by ~5.2kb. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.
* x86: Add evex optimized functions for the wchar_t strcpy familyNoah Goldstein2022-11-081-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implemented: wcscat-evex (+ 905 bytes) wcscpy-evex (+ 674 bytes) wcpcpy-evex (+ 709 bytes) wcsncpy-evex (+1358 bytes) wcpncpy-evex (+1467 bytes) wcsncat-evex (+1213 bytes) Performance Changes: Times are from N = 10 runs of the benchmark suite and are reported as geometric mean of all ratios of New Implementation / Best Old Implementation. Best Old Implementation was determined with the highest ISA implementation. wcscat-evex -> 0.991 wcscpy-evex -> 0.587 wcpcpy-evex -> 0.695 wcsncpy-evex -> 0.719 wcpncpy-evex -> 0.694 wcsncat-evex -> 0.979 Code Size Changes: This change increase the size of libc.so by ~6.3kb bytes. For reference the patch optimizing the normal strcpy family functions decreases libc.so by ~5.7kb. Full check passes on x86-64 and build succeeds for all ISA levels w/ and w/o multiarch.
* x86_64: Implement evex512 version of strrchr and wcsrchrSunil K Pandey2022-11-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Changes from v1: Use vec api for register. Replace VPCMP with VPCMPEQ Restructure and remove 1 unconditional jump. Change page cross logic to use sall. This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - strrchr function using 512 bit vectors. - wcsrchr function using 512 bit vectors. Code size data: strrchr-evex.o 879 byte strrchr-evex512.o 601 byte (-32%) wcsrchr-evex.o 882 byte wcsrchr-evex512.o 572 byte (-35%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86_64: Implement evex512 version of strchrnul, strchr and wcschrSunil K Pandey2022-10-251-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - strchrnul function using 512 bit vectors. - strchr function using 512 bit vectors. - wcschr function using 512 bit vectors. Code size data: strchrnul-evex.o 599 byte strchrnul-evex512.o 569 byte (-5%) strchr-evex.o 639 byte strchr-evex512.o 595 byte (-7%) wcschr-evex.o 644 byte wcschr-evex512.o 607 byte (-6%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86_64: Implement evex512 version of memchr, rawmemchr and wmemchrSunil K Pandey2022-10-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | This patch implements following evex512 version of string functions. evex512 version takes up to 30% less cycle as compared to evex, depending on length and alignment. - memchr function using 512 bit vectors. - rawmemchr function using 512 bit vectors. - wmemchr function using 512 bit vectors. Code size data: memchr-evex.o 762 byte memchr-evex512.o 576 byte (-24%) rawmemchr-evex.o 461 byte rawmemchr-evex512.o 412 byte (-11%) wmemchr-evex.o 794 byte wmemchr-evex512.o 552 byte (-30%) Placeholder function, not used by any processor at the moment. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Add support to build wcscpy with explicit ISA levelNoah Goldstein2022-07-161-1/+0
| | | | | | | | | | | | | | | | | | 1. Add ISA level build guards to different implementations. - wcscpy-ssse3.S is used as ISA level 2/3/4. - wcscpy-generic.c is only used at ISA level 1 and will only build if compiled with ISA level == 1. Otherwise there is no reason to include it as we will always use wcscpy-ssse3.S 2. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add support to build strcmp/strlen/strchr with explicit ISA levelNoah Goldstein2022-07-161-2/+0
| | | | | | | | | | | | | | | | | | | | 1. Add default ISA level selection in non-multiarch/rtld implementations. 2. Add ISA level build guards to different implementations. - I.e strcmp-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (strcmp-evex.S). 3. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Remove generic strncat, strncpy, and stpncpy implementationsNoah Goldstein2022-07-121-3/+0
| | | | | | | | | | | | | | | | | | These functions all have optimized versions: __strncat_sse2_unaligned, __strncpy_sse2_unaligned, and stpncpy_sse2_unaligned which are faster than their respective generic implementations. Since the sse2 versions can run on baseline x86_64, we should use these as the baseline implementation and can remove the generic implementations. Geometric mean of N=20 runs of the entire benchmark suite on: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake) __strncat_sse2_unaligned / __strncat_generic: .944 __strncpy_sse2_unaligned / __strncpy_generic: .726 __stpncpy_sse2_unaligned / __stpncpy_generic: .650 Tested build with and without multiarch and full check with multiarch.
* x86-64: Remove redundant strcspn-generic/strpbrk-generic/strspn-genericH.J. Lu2022-07-081-3/+0
| | | | | | | | | | | | | Remove redundant strcspn-generic, strpbrk-generic and strspn-generic from sysdep_routines in sysdeps/x86_64/multiarch/Makefile added by commit c69f960b017b2cdf39335739009526a72fb20379 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Jul 3 21:28:07 2022 -0700 x86: Add support for building str{c|p}{brk|spn} with explicit ISA level since they have been added to sysdep_routines in sysdeps/x86_64/Makefile.
* x86: Move mem{p}{mov|cpy}_{chk_}erms to its own fileNoah Goldstein2022-06-291-0/+1
| | | | | | The primary memmove_{impl}_unaligned_erms implementations don't interact with this function. Putting them in same file both wastes space and unnecessarily bloats a hot code section.
* x86: Move and slightly improve memset_ermsNoah Goldstein2022-06-291-0/+1
| | | | | | | | | | | | | | Implementation wise: 1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not use the L(stosb) label that was previously defined. 2. Don't give the hotpath (fallthrough) to zero size. Code positioning wise: Move memset_{chk}_erms to its own file. Leaving it in between the memset_{impl}_unaligned both adds unnecessary complexity to the file and wastes space in a relatively hot cache section.
* x86: Rename generic functions with unique postfix for clarityNoah Goldstein2022-06-161-16/+17
| | | | | | | | | | No functions are changed. It just renames generic implementations from '{func}_sse2' to '{func}_generic'. This is just because the postfix "_sse2" was overloaded and was used for files that had hand-optimized sse2 assembly implementations and files that just redirected back to the generic implementation. Full xcheck passed on x86_64.
* x86_64: Add strstr function with 512-bit EVEXRaghuveer Devulapalli2022-06-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding a 512-bit EVEX version of strstr. The algorithm works as follows: (1) We spend a few cycles at the begining to peek into the needle. We locate an edge in the needle (first occurance of 2 consequent distinct characters) and also store the first 64-bytes into a zmm register. (2) We search for the edge in the haystack by looking into one cache line of the haystack at a time. This avoids having to read past a page boundary which can cause a seg fault. (3) If an edge is found in the haystack we first compare the first 64-bytes of the needle (already stored in a zmm register) before we proceed with a full string compare performed byte by byte. Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512) Geometric mean of all benchmarks: new / old = 0.66 Difficult skiptable(0) : new / old = 0.02 Difficult skiptable(1) : new / old = 0.01 Difficult 2-way : new / old = 0.25 Difficult testing first 2 : new / old = 1.26 Difficult skiptable(0) : new / old = 0.05 Difficult skiptable(1) : new / old = 0.06 Difficult 2-way : new / old = 0.26 Difficult testing first 2 : new / old = 1.05 Difficult skiptable(0) : new / old = 0.42 Difficult skiptable(1) : new / old = 0.24 Difficult 2-way : new / old = 0.21 Difficult testing first 2 : new / old = 1.04 Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlenSunil K Pandey2022-05-261-0/+4
| | | | | | | | | | | | | | | This patch implements following evex512 version of string functions. Perf gain for evex512 version is up to 50% as compared to evex, depending on length and alignment. Placeholder function, not used by any processor at the moment. - String length function using 512 bit vectors. - String N length using 512 bit vectors. - Wide string length using 512 bit vectors. - Wide string N length using 512 bit vectors. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86_64: Remove bzero optimizationAdhemerval Zanella2022-05-161-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both symbols are marked as legacy in POSIX.1-2001 and removed on POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE or _DEFAULT_SOURCE. GCC also replaces bcopy with a memmove and bzero with memset on default configuration (to actually get a bzero libc call the code requires to omit string.h inclusion and built with -fno-builtin), so it is highly unlikely programs are actually calling libc bzero symbol. On a recent Linux distro (Ubuntu 22.04), there is no bzero calls by the installed binaries. $ cat count_bstring.sh #!/bin/bash files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done` total=0 for file in $files; do symbols=`objdump -R $file 2>&1` if [ $? -eq 0 ]; then ncalls=`echo $symbols | grep -w $1 | wc -l` ((total=total+ncalls)) if [ $ncalls -gt 0 ]; then echo "$file: $ncalls" fi fi done echo "TOTAL=$total" $ ./count_bstring.sh bzero TOTAL=0 Checked on x86_64-linux-gnu.
* x86: Remove memcmp-sse4.SNoah Goldstein2022-04-151-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code didn't actually use any sse4 instructions since `ptest` was removed in: commit 2f9062d7171850451e6044ef78d91ff8c017b9c0 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Nov 10 16:18:56 2021 -0600 x86: Shrink memcmp-sse4.S code size The new memcmp-sse2 implementation is also faster. geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905 Note there are two regressions preferring SSE2 for Size = 1 and Size = 65. Size = 1: size, align0, align1, ret, New Time/Old Time 1, 1, 1, 0, 1.2 1, 1, 1, 1, 1.197 1, 1, 1, -1, 1.2 This is intentional. Size == 1 is significantly less hot based on profiles of GCC11 and Python3 than sizes [4, 8] (which is made hotter). Python3 Size = 1 -> 13.64% Python3 Size = [4, 8] -> 60.92% GCC11 Size = 1 -> 1.29% GCC11 Size = [4, 8] -> 33.86% size, align0, align1, ret, New Time/Old Time 4, 4, 4, 0, 0.622 4, 4, 4, 1, 0.797 4, 4, 4, -1, 0.805 5, 5, 5, 0, 0.623 5, 5, 5, 1, 0.777 5, 5, 5, -1, 0.802 6, 6, 6, 0, 0.625 6, 6, 6, 1, 0.813 6, 6, 6, -1, 0.788 7, 7, 7, 0, 0.625 7, 7, 7, 1, 0.799 7, 7, 7, -1, 0.795 8, 8, 8, 0, 0.625 8, 8, 8, 1, 0.848 8, 8, 8, -1, 0.914 9, 9, 9, 0, 0.625 Size = 65: size, align0, align1, ret, New Time/Old Time 65, 0, 0, 0, 1.103 65, 0, 0, 1, 1.216 65, 0, 0, -1, 1.227 65, 65, 0, 0, 1.091 65, 0, 65, 1, 1.19 65, 65, 65, -1, 1.215 This is because A) the checks in range [65, 96] are now unrolled 2x and B) because smaller values <= 16 are now given a hotter path. By contrast the SSE4 version has a branch for Size = 80. The unrolled version has get better performance for returns which need both comparisons. size, align0, align1, ret, New Time/Old Time 128, 4, 8, 0, 0.858 128, 4, 8, 1, 0.879 128, 4, 8, -1, 0.888 As well, out of microbenchmark environments that are not full predictable the branch will have a real-cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memcmp SSE2 in memcmp.SNoah Goldstein2022-04-151-1/+1
| | | | | | | | New code save size (-303 bytes) and has significantly better performance. geometric_mean(N=20) of page cross cases New / Original: 0.634 Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Reduce code size of mem{move|pcpy|cpy}-ssse3Noah Goldstein2022-04-141-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The goal is to remove most SSSE3 function as SSE4, AVX2, and EVEX are generally preferable. memcpy/memmove is one exception where avoiding unaligned loads with `palignr` is important for some targets. This commit replaces memmove-ssse3 with a better optimized are lower code footprint verion. As well it aliases memcpy to memmove. Aside from this function all other SSSE3 functions should be safe to remove. The performance is not changed drastically although shows overall improvements without any major regressions or gains. bench-memcpy geometric_mean(N=50) New / Original: 0.957 bench-memcpy-random geometric_mean(N=50) New / Original: 0.912 bench-memcpy-large geometric_mean(N=50) New / Original: 0.892 Benchmarks where run on Zhaoxin KX-6840@2000MHz See attached numbers for all results. More important this saves 7246 bytes of code size in memmove an additional 10741 bytes by reusing memmove code for memcpy (total 17987 bytes saves). As well an additional 896 bytes of rodata for the jump table entries.
* x86: Remove mem{move|cpy}-ssse3-backNoah Goldstein2022-04-141-2/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove str{p}{n}cpy-ssse3Noah Goldstein2022-04-141-4/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove str{n}cat-ssse3Noah Goldstein2022-04-141-2/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove str{n}{case}cmp-ssse3Noah Goldstein2022-04-141-4/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove {w}memcmp-ssse3Noah Goldstein2022-04-141-2/+0
| | | | | | | With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer SSSE3. As a result it is no longer worth it to keep the SSSE3 versions given the code size cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove AVX str{n}casecmpNoah Goldstein2022-03-251-2/+0
| | | | | | | | | | | | | | | | | The rational is: 1. SSE42 has nearly identical logic so any benefit is minimal (3.4% regression on Tigerlake using SSE42 versus AVX across the benchtest suite). 2. AVX2 version covers the majority of targets that previously prefered it. 3. The targets where AVX would still be best (SnB and IVB) are becoming outdated. All in all the saving the code size is worth it. All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Add EVEX optimized str{n}casecmpNoah Goldstein2022-03-251-0/+2
| | | | | | | geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Add AVX2 optimized str{n}casecmpNoah Goldstein2022-03-251-0/+4
| | | | | | | geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Optimize bzeroH.J. Lu2022-02-081-0/+1
| | | | | | | | | | memset with zero as the value to set is by far the majority value (99%+ for Python3 and GCC). bzero can be slightly more optimized for this case by using a zero-idiom xor for broadcasting the set value to a register (vector or GPR). Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86_64/multiarch: Sort sysdep_routines and put one entry per lineH.J. Lu2022-02-051-114/+176
|
* x86_64: Add support for __memcmpeq using sse2, avx2, and evexNoah Goldstein2021-10-271-0/+4
| | | | | | No bug. This commit adds support for __memcmpeq to be implemented seperately from memcmp. Support is added for versions optimized with sse2, avx2, and evex.
* x86-64: Add wcslen optimize for sse4.1Noah Goldstein2021-06-231-2/+2
| | | | | | | | | No bug. This comment adds the ifunc / build infrastructure necessary for wcslen to prefer the sse4.1 implementation in strlen-vec.S. test-wcslen.c is passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Add EVEX optimized memchr family not safe for RTMNoah Goldstein2021-05-081-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. This commit adds a new implementation for EVEX memchr that is not safe for RTM because it uses vzeroupper. The benefit is that by using ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is faster than the RTM safe version which cannot use vpcmpeq because there is no EVEX encoding for the instruction. All parts of the implementation aside from the 4x loop are the same for the two versions and the optimization is only relevant for large sizes. Tigerlake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 9.2 , 9.04 , no-RTM , 0.16 512 , 7 , 224 , 9.19 , 8.98 , no-RTM , 0.21 2048 , 0 , 256 , 10.74 , 10.54 , no-RTM , 0.2 2048 , 0 , 512 , 14.81 , 14.87 , RTM , 0.06 2048 , 0 , 1024 , 22.97 , 22.57 , no-RTM , 0.4 2048 , 0 , 2048 , 37.49 , 34.51 , no-RTM , 2.98 <-- Icelake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 7.6 , 7.3 , no-RTM , 0.3 512 , 7 , 224 , 7.63 , 7.27 , no-RTM , 0.36 2048 , 0 , 256 , 8.48 , 8.38 , no-RTM , 0.1 2048 , 0 , 512 , 11.57 , 11.42 , no-RTM , 0.15 2048 , 0 , 1024 , 17.92 , 17.38 , no-RTM , 0.54 2048 , 0 , 2048 , 30.37 , 27.34 , no-RTM , 3.03 <-- test-memchr, test-wmemchr, and test-rawmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add AVX optimized string/memory functions for RTMH.J. Lu2021-03-291-0/+27
| | | | | | | | | | | | | | | | | Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX optimized string/memory functions with xtest jz 1f vzeroall ret 1: vzeroupper ret at function exit on processors with usable RTM, but without 256-bit EVEX instructions to avoid VZEROUPPER inside a transactionally executing RTM region.
* x86-64: Add memcmp family functions with 256-bit EVEXH.J. Lu2021-03-291-1/+3
| | | | | | | Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function exit.
* x86-64: Add memset family functions with 256-bit EVEXH.J. Lu2021-03-291-0/+1
| | | | | | | Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
* x86-64: Add memmove family functions with 256-bit EVEXH.J. Lu2021-03-291-0/+1
| | | | | | Update ifunc-memmove.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit.
* x86-64: Add strcpy family functions with 256-bit EVEXH.J. Lu2021-03-291-0/+6
| | | | | | Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
* x86-64: Add ifunc-avx2.h functions with 256-bit EVEXH.J. Lu2021-03-291-2/+19
| | | | | | | | | | Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW and BMI2 since VZEROUPPER isn't needed at function exit. For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP is set.
* x86: Install <sys/platform/x86.h> [BZ #26124]H.J. Lu2020-09-111-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Install <sys/platform/x86.h> so that programmers can do #if __has_include(<sys/platform/x86.h>) #include <sys/platform/x86.h> #endif ... if (CPU_FEATURE_USABLE (SSE2)) ... if (CPU_FEATURE_USABLE (AVX2)) ... <sys/platform/x86.h> exports only: enum { COMMON_CPUID_INDEX_1 = 0, COMMON_CPUID_INDEX_7, COMMON_CPUID_INDEX_80000001, COMMON_CPUID_INDEX_D_ECX_1, COMMON_CPUID_INDEX_80000007, COMMON_CPUID_INDEX_80000008, COMMON_CPUID_INDEX_7_ECX_1, /* Keep the following line at the end. */ COMMON_CPUID_INDEX_MAX }; struct cpuid_features { struct cpuid_registers cpuid; struct cpuid_registers usable; }; struct cpu_features { struct cpu_features_basic basic; struct cpuid_features features[COMMON_CPUID_INDEX_MAX]; }; /* Get a pointer to the CPU features structure. */ extern const struct cpu_features *__x86_get_cpu_features (unsigned int max) __attribute__ ((const)); Since all feature checks are done through macros, programs compiled with a newer <sys/platform/x86.h> are compatible with the older glibc binaries as long as the layout of struct cpu_features is identical. The features array can be expanded with backward binary compatibility for both .o and .so files. When COMMON_CPUID_INDEX_MAX is increased to support new processor features, __x86_get_cpu_features in the older glibc binaries returns NULL and HAS_CPU_FEATURE/CPU_FEATURE_USABLE return false on the new processor feature. No new symbol version is neeeded. Both CPU_FEATURE_USABLE and HAS_CPU_FEATURE are provided. HAS_CPU_FEATURE can be used to identify processor features. Note: Although GCC has __builtin_cpu_supports, it only supports a subset of <sys/platform/x86.h> and it is equivalent to CPU_FEATURE_USABLE. It doesn't support HAS_CPU_FEATURE.
* x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2Leonardo Sandoval2019-01-141-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize x86-64 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2. It uses vector comparison as much as possible. In general, the larger the source string, the greater performance gain observed, reaching speedups of 1.6x compared to SSE2 unaligned routines. Select AVX2 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcat-avx2, strncat-avx2, strcpy-avx2, strncpy-avx2, stpcpy-avx2 and stpncpy-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: (__libc_ifunc_impl_list): Add tests for __strcat_avx2, __strncat_avx2, __strcpy_avx2, __strncpy_avx2, __stpcpy_avx2 and __stpncpy_avx2. * sysdeps/x86_64/multiarch/{ifunc-unaligned-ssse3.h => ifunc-strcpy.h}: rename header for a more generic name. * sysdeps/x86_64/multiarch/ifunc-strcpy.h: (IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if AVX unaligned load is fast and vzeroupper is preferred. * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file * sysdeps/x86_64/multiarch/stpncpy-avx2.S: Likewise * sysdeps/x86_64/multiarch/strcat-avx2.S: Likewise * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise * sysdeps/x86_64/multiarch/strncat-avx2.S: Likewise * sysdeps/x86_64/multiarch/strncpy-avx2.S: Likewise
* x86-64: Optimize strcmp/wcscmp and strncmp/wcsncmp with AVX2Leonardo Sandoval2018-06-011-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize x86-64 strcmp/wcscmp and strncmp/wcsncmp with AVX2. It uses vector comparison as much as possible. Peak performance observed on a SkyLake machine: 9x, 3x, 2.5x and 5.5x for strcmp, strncmp, wcscmp and wcsncmp, respectively. The larger the comparison length, the more benefit using avx2 functions, except on the strcmp, where peak is observed at length == 32 bytes. Select AVX2 strcmp/wcscmp on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcmp-avx2, strncmp-avx2, wcscmp-avx2, wcscmp-sse2, wcsncmp-avx2 and wcsncmp-sse2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __strcmp_avx2, __strncmp_avx2, __wcscmp_avx2, __wcsncmp_avx2, __wcscmp_sse2 and __wcsncmp_sse2. * sysdeps/x86_64/multiarch/strcmp.c (OPTIMIZE (avx2)): (IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if AVX unaligned load is fast and vzeroupper is preferred. * sysdeps/x86_64/multiarch/strncmp.c: Likewise. * sysdeps/x86_64/multiarch/strcmp-avx2.S: New file. * sysdeps/x86_64/multiarch/strncmp-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcscmp-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcscmp-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wcscmp.c: Likewise. * sysdeps/x86_64/multiarch/wcsncmp-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcsncmp-sse2.c: Likewise. * sysdeps/x86_64/multiarch/wcsncmp.c: Likewise. * sysdeps/x86_64/wcscmp.S (__wcscmp): Add alias only if __wcscmp is undefined.
* x86-64: Implement strcmp family IFUNC selectors in CH.J. Lu2017-06-211-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement strcmp family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for strcmp family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcmp-sse2, strcmp-sse4_2, strncmp-sse2, strncmp-sse4_2, strcasecmp_l-sse2, strcasecmp_l-sse4_2, strcasecmp_l-avx, strncase_l-sse2, strncase_l-sse4_2 and strncase_l-avx. * sysdeps/x86_64/multiarch/ifunc-strcasecmp.h: New file. * sysdeps/x86_64/multiarch/strcasecmp.c: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l-avx.S: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l-sse4_2.S: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l.c: Likewise. * sysdeps/x86_64/multiarch/strcmp-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strcmp-sse4_2.S: Likewise. * sysdeps/x86_64/multiarch/strcmp.c: Likewise. * sysdeps/x86_64/multiarch/strncase.c: Likewise. * sysdeps/x86_64/multiarch/strncase_l-avx.S : Likewise. * sysdeps/x86_64/multiarch/strncase_l-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strncase_l-sse4_2.S: Likewise. * sysdeps/x86_64/multiarch/strncase_l.c: Likewise. * sysdeps/x86_64/multiarch/strncmp-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strncmp-sse4_2.S: Likewise. * sysdeps/x86_64/multiarch/strncmp.c: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l.S: Removed. * sysdeps/x86_64/multiarch/strcmp.S: Likewise. * sysdeps/x86_64/multiarch/strncase_l.S: Likewise. * sysdeps/x86_64/multiarch/strncmp.S: Likewise. * sysdeps/x86_64/multiarch/strcmp-sse42.S: Include <sysdep.h>. (STRCMP_SSE42): New. Defined to __strcmp_sse42 if not defined. [USE_AS_STRCASECMP_L || USE_AS_STRNCASECMP_L]: Include "locale-defines.h". (UPDATE_STRNCMP_COUNTER): New. (SECTION): Likewise. (GLABEL): Likewise. (LABEL): Likewise. * sysdeps/x86_64/multiarch/strncmp-ssse3.S: Rewrite and enable for libc.a.
* x86-64: Implement strcspn/strpbrk/strspn IFUNC selectors in CH.J. Lu2017-06-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement strcspn/strpbrk/strspn IFUNC selectors in C All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for strcspn/strpbrk/strspn functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcspn-sse2, strpbrk-sse2 and strspn-sse2. * sysdeps/x86_64/strcspn.S (STRPBRK_P): Removed. Check USE_AS_STRPBRK instead of STRPBRK_P. * sysdeps/x86_64/strpbrk.S (USE_AS_STRPBRK): New. * sysdeps/x86_64/multiarch/ifunc-sse4_2.h: New file. * sysdeps/x86_64/multiarch/strcspn-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strcspn.c: Likewise. * sysdeps/x86_64/multiarch/strpbrk-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strpbrk.c: Likewise. * sysdeps/x86_64/multiarch/strspn-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strspn.c: Likewise. * sysdeps/x86_64/multiarch/strcspn.S: Removed. * sysdeps/x86_64/multiarch/strpbrk.S: Likewise. * sysdeps/x86_64/multiarch/strspn.S: Likewise. * sysdeps/x86_64/multiarch/strpbrk-c.c: Remove "#ifdef SHARED" and "#endif".
* x86-64: Implement strcat family IFUNC selectors in CH.J. Lu2017-06-151-0/+1
| | | | | | | | | | | | | | | | | | | | Implement strcat family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for strcat family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcat-sse2. * sysdeps/x86_64/multiarch/strcat-sse2.S: New file. * sysdeps/x86_64/multiarch/strcat.c: Likewise. * sysdeps/x86_64/multiarch/strncat.c: Likewise. * sysdeps/x86_64/multiarch/strcat.S: Removed. * sysdeps/x86_64/multiarch/strncat.S: Likewise.
* x86-64: Implement memcmp family IFUNC selectors in CH.J. Lu2017-06-151-0/+1
| | | | | | | | | | | | | | | | | | | | | Implement memcmp family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for memcmp family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcmp-sse2. * sysdeps/x86_64/multiarch/ifunc-memcmp.h: New file. * sysdeps/x86_64/multiarch/memcmp-sse2.S: Likewise. * sysdeps/x86_64/multiarch/memcmp.c: Likewise. * sysdeps/x86_64/multiarch/wmemcmp.c: Likewise. * sysdeps/x86_64/multiarch/memcmp.S: Removed. * sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
* x86-64: Implement memset family IFUNC selectors in CH.J. Lu2017-06-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement memset family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for memset functions within libc. 2017-06-07 H.J. Lu <hongjiu.lu@intel.com> Erich Elsen <eriche@google.com> * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, and memset_chk-nonshared. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add test for __memset_chk_erms. Update comments. * sysdeps/x86_64/multiarch/ifunc-memset.h: New file. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.c: Likewise. * sysdeps/x86_64/multiarch/memset_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.c: Likewise. * sysdeps/x86_64/multiarch/memset.S: Removed. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset_chk_erms): New function.
* x86-64: Implement memmove family IFUNC selectors in CH.J. Lu2017-06-141-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement memmove family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for memmove family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memcpy_chk-nonshared, mempcpy_chk-nonshared and memmove_chk-nonshared. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __memmove_chk_erms, __memcpy_chk_erms and __mempcpy_chk_erms. Update comments. * sysdeps/x86_64/multiarch/ifunc-memmove.h: New file. * sysdeps/x86_64/multiarch/memcpy.c: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk.c: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.c: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Removed. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (__mempcpy_chk_erms): New function. (__memmove_chk_erms): Likewise. (__memcpy_chk_erms): New alias.
* x86-64: Implement strcpy family IFUNC selectors in CH.J. Lu2017-06-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement strcpy family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for strcpy family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcpy-sse2 and stpcpy-sse2. * sysdeps/x86_64/multiarch/ifunc-unaligned-ssse3.h: New file. * sysdeps/x86_64/multiarch/stpcpy-sse2.S: Likewise. * sysdeps/x86_64/multiarch/stpcpy.c: Likewise. * sysdeps/x86_64/multiarch/stpncpy.c: Likewise. * sysdeps/x86_64/multiarch/strcpy-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strcpy.c: Likewise. * sysdeps/x86_64/multiarch/strncpy.c: Likewise. * sysdeps/x86_64/multiarch/stpcpy.S: Removed. * sysdeps/x86_64/multiarch/stpncpy.S: Likewise. * sysdeps/x86_64/multiarch/strcpy.S: Likewise. * sysdeps/x86_64/multiarch/strncpy.S: Likewise. * sysdeps/x86_64/multiarch/stpncpy-c.c (weak_alias): New. (libc_hidden_def): Always defined as empty. * sysdeps/x86_64/multiarch/strncpy-c.c (libc_hidden_builtin_def): Always Defined as empty.
* x86-64: Optimize strrchr/wcsrchr with AVX2H.J. Lu2017-06-091-0/+2
| | | | | | | | | | | | | | | | | | | | Optimize strrchr/wcsrchr with AVX2 to check 32 bytes with vector instructions. It is as fast as SSE2 version for small data sizes and up to 1X faster for large data sizes on Haswell. Select AVX2 version on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strrchr-sse2, strrchr-avx2, wcsrchr-sse2 and wcsrchr-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __strrchr_avx2, __strrchr_sse2, __wcsrchr_avx2 and __wcsrchr_sse2. * sysdeps/x86_64/multiarch/strrchr-avx2.S: New file. * sysdeps/x86_64/multiarch/strrchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strrchr.c: Likewise. * sysdeps/x86_64/multiarch/wcsrchr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcsrchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wcsrchr.c: Likewise.