about summary refs log tree commit diff
path: root/sysdeps/x86_64/multiarch/ifunc-impl-list.c
Commit message (Collapse)AuthorAgeFilesLines
* x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.SNoah Goldstein2021-04-191-12/+28
| | | | | | | | No bug. This commit adds optimized cased for less_vec memset case that uses the avx512vl/avx512bw mask store avoiding the excessive branches. test-memset and test-wmemset are passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for strchr-avx2.SH.J. Lu2021-04-191-3/+9
| | | | | | | | | | | | | | | | | Since strchr-avx2.S updated by commit 1f745ecc2109890886b161d4791e1406fdfc29b8 Author: noah <goldstein.w.n@gmail.com> Date: Wed Feb 3 00:38:59 2021 -0500 x86-64: Refactor and improve performance of strchr-avx2.S uses sarx: c4 e2 72 f7 c0 sarx %ecx,%eax,%eax for strchr-avx2 family functions, require BMI2 in ifunc-impl-list.c and ifunc-avx2.h.
* x86-64: Require BMI2 for __strlen_evex and __strnlen_evexH.J. Lu2021-04-191-2/+4
| | | | | | | | | | | | | | | | | Since __strlen_evex and __strnlen_evex added by commit 1fd8c163a83d96ace1ff78fa6bac7aee084f6f77 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 5 06:24:52 2021 -0800 x86-64: Add ifunc-avx2.h functions with 256-bit EVEX use sarx: c4 e2 6a f7 c0 sarx %edx,%eax,%eax require BMI2 for __strlen_evex and __strnlen_evex in ifunc-impl-list.c. ifunc-avx2.h already requires BMI2 for EVEX implementation.
* x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functionsH.J. Lu2021-03-291-12/+12
| | | | | | Update ifunc-memmove.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit.
* x86-64: Use ZMM16-ZMM31 in AVX512 memset family functionsH.J. Lu2021-03-291-5/+9
| | | | | | | Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
* x86-64: Add AVX optimized string/memory functions for RTMH.J. Lu2021-03-291-0/+170
| | | | | | | | | | | | | | | | | Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX optimized string/memory functions with xtest jz 1f vzeroall ret 1: vzeroupper ret at function exit on processors with usable RTM, but without 256-bit EVEX instructions to avoid VZEROUPPER inside a transactionally executing RTM region.
* x86-64: Add memcmp family functions with 256-bit EVEXH.J. Lu2021-03-291-0/+10
| | | | | | | Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function exit.
* x86-64: Add memset family functions with 256-bit EVEXH.J. Lu2021-03-291-0/+22
| | | | | | | Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
* x86-64: Add memmove family functions with 256-bit EVEXH.J. Lu2021-03-291-0/+36
| | | | | | Update ifunc-memmove.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit.
* x86-64: Add strcpy family functions with 256-bit EVEXH.J. Lu2021-03-291-0/+24
| | | | | | Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
* x86-64: Add ifunc-avx2.h functions with 256-bit EVEXH.J. Lu2021-03-291-0/+81
| | | | | | | | | | Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW and BMI2 since VZEROUPPER isn't needed at function exit. For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP is set.
* Update copyright dates with scripts/update-copyrightsPaul Eggert2021-01-021-1/+1
| | | | | | | | | | | | | | | | I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 6694 files FOO. I then removed trailing white space from benchtests/bench-pthread-locks.c and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this diagnostic from Savannah: remote: *** pre-commit check failed ... remote: *** error: lines with trailing whitespace found remote: error: hook declined to update refs/heads/master
* x86: Support usable check for all CPU featuresH.J. Lu2020-07-131-114/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support usable check for all CPU features with the following changes: 1. Change struct cpu_features to struct cpuid_features { struct cpuid_registers cpuid; struct cpuid_registers usable; }; struct cpu_features { struct cpu_features_basic basic; struct cpuid_features features[COMMON_CPUID_INDEX_MAX]; unsigned int preferred[PREFERRED_FEATURE_INDEX_MAX]; ... }; so that there is a usable bit for each cpuid bit. 2. After the cpuid bits have been initialized, copy the known bits to the usable bits. EAX/EBX from INDEX_1 and EAX from INDEX_7 aren't used for CPU feature detection. 3. Clear the usable bits which require OS support. 4. If the feature is supported by OS, copy its cpuid bit to its usable bit. 5. Replace HAS_CPU_FEATURE and CPU_FEATURES_CPU_P with CPU_FEATURE_USABLE and CPU_FEATURE_USABLE_P to check if a feature is usable. 6. Add DEPR_FPU_CS_DS for INDEX_7_EBX_13. 7. Unset MPX feature since it has been deprecated. The results are 1. If the feature is known and doesn't requre OS support, its usable bit is copied from the cpuid bit. 2. Otherwise, its usable bit is copied from the cpuid bit only if the feature is known to supported by OS. 3. CPU_FEATURE_USABLE/CPU_FEATURE_USABLE_P are used to check if the feature can be used. 4. HAS_CPU_FEATURE/CPU_FEATURE_CPU_P are used to check if CPU supports the feature.
* Update copyright dates with scripts/update-copyrights.Joseph Myers2020-01-011-1/+1
|
* Prefer https to http for gnu.org and fsf.org URLsPaul Eggert2019-09-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also, change sources.redhat.com to sourceware.org. This patch was automatically generated by running the following shell script, which uses GNU sed, and which avoids modifying files imported from upstream: sed -ri ' s,(http|ftp)(://(.*\.)?(gnu|fsf|sourceware)\.org($|[^.]|\.[^a-z])),https\2,g s,(http|ftp)(://(.*\.)?)sources\.redhat\.com($|[^.]|\.[^a-z]),https\2sourceware.org\4,g ' \ $(find $(git ls-files) -prune -type f \ ! -name '*.po' \ ! -name 'ChangeLog*' \ ! -path COPYING ! -path COPYING.LIB \ ! -path manual/fdl-1.3.texi ! -path manual/lgpl-2.1.texi \ ! -path manual/texinfo.tex ! -path scripts/config.guess \ ! -path scripts/config.sub ! -path scripts/install-sh \ ! -path scripts/mkinstalldirs ! -path scripts/move-if-change \ ! -path INSTALL ! -path locale/programs/charmap-kw.h \ ! -path po/libc.pot ! -path sysdeps/gnu/errlist.c \ ! '(' -name configure \ -execdir test -f configure.ac -o -f configure.in ';' ')' \ ! '(' -name preconfigure \ -execdir test -f preconfigure.ac ';' ')' \ -print) and then by running 'make dist-prepare' to regenerate files built from the altered files, and then executing the following to cleanup: chmod a+x sysdeps/unix/sysv/linux/riscv/configure # Omit irrelevant whitespace and comment-only changes, # perhaps from a slightly-different Autoconf version. git checkout -f \ sysdeps/csky/configure \ sysdeps/hppa/configure \ sysdeps/riscv/configure \ sysdeps/unix/sysv/linux/csky/configure # Omit changes that caused a pre-commit check to fail like this: # remote: *** error: sysdeps/powerpc/powerpc64/ppc-mcount.S: trailing lines git checkout -f \ sysdeps/powerpc/powerpc64/ppc-mcount.S \ sysdeps/unix/sysv/linux/s390/s390-64/syscall.S # Omit change that caused a pre-commit check to fail like this: # remote: *** error: sysdeps/sparc/sparc64/multiarch/memcpy-ultra3.S: last line does not end in newline git checkout -f sysdeps/sparc/sparc64/multiarch/memcpy-ultra3.S
* x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2Leonardo Sandoval2019-01-141-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize x86-64 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2. It uses vector comparison as much as possible. In general, the larger the source string, the greater performance gain observed, reaching speedups of 1.6x compared to SSE2 unaligned routines. Select AVX2 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcat-avx2, strncat-avx2, strcpy-avx2, strncpy-avx2, stpcpy-avx2 and stpncpy-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: (__libc_ifunc_impl_list): Add tests for __strcat_avx2, __strncat_avx2, __strcpy_avx2, __strncpy_avx2, __stpcpy_avx2 and __stpncpy_avx2. * sysdeps/x86_64/multiarch/{ifunc-unaligned-ssse3.h => ifunc-strcpy.h}: rename header for a more generic name. * sysdeps/x86_64/multiarch/ifunc-strcpy.h: (IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if AVX unaligned load is fast and vzeroupper is preferred. * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file * sysdeps/x86_64/multiarch/stpncpy-avx2.S: Likewise * sysdeps/x86_64/multiarch/strcat-avx2.S: Likewise * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise * sysdeps/x86_64/multiarch/strncat-avx2.S: Likewise * sysdeps/x86_64/multiarch/strncpy-avx2.S: Likewise
* Update copyright dates with scripts/update-copyrights.Joseph Myers2019-01-011-1/+1
| | | | | | | * All files with FSF copyright notices: Update copyright dates using scripts/update-copyrights. * locale/programs/charmap-kw.h: Regenerated. * locale/programs/locfile-kw.h: Likewise.
* x86-64: Optimize strcmp/wcscmp and strncmp/wcsncmp with AVX2Leonardo Sandoval2018-06-011-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize x86-64 strcmp/wcscmp and strncmp/wcsncmp with AVX2. It uses vector comparison as much as possible. Peak performance observed on a SkyLake machine: 9x, 3x, 2.5x and 5.5x for strcmp, strncmp, wcscmp and wcsncmp, respectively. The larger the comparison length, the more benefit using avx2 functions, except on the strcmp, where peak is observed at length == 32 bytes. Select AVX2 strcmp/wcscmp on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcmp-avx2, strncmp-avx2, wcscmp-avx2, wcscmp-sse2, wcsncmp-avx2 and wcsncmp-sse2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __strcmp_avx2, __strncmp_avx2, __wcscmp_avx2, __wcsncmp_avx2, __wcscmp_sse2 and __wcsncmp_sse2. * sysdeps/x86_64/multiarch/strcmp.c (OPTIMIZE (avx2)): (IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if AVX unaligned load is fast and vzeroupper is preferred. * sysdeps/x86_64/multiarch/strncmp.c: Likewise. * sysdeps/x86_64/multiarch/strcmp-avx2.S: New file. * sysdeps/x86_64/multiarch/strncmp-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcscmp-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcscmp-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wcscmp.c: Likewise. * sysdeps/x86_64/multiarch/wcsncmp-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcsncmp-sse2.c: Likewise. * sysdeps/x86_64/multiarch/wcsncmp.c: Likewise. * sysdeps/x86_64/wcscmp.S (__wcscmp): Add alias only if __wcscmp is undefined.
* Update copyright dates with scripts/update-copyrights.Joseph Myers2018-01-011-1/+1
| | | | | | | * All files with FSF copyright notices: Update copyright dates using scripts/update-copyrights. * locale/programs/charmap-kw.h: Regenerated. * locale/programs/locfile-kw.h: Likewise.
* x86-64: Use IFUNC memcpy and mempcpy in libc.aH.J. Lu2017-08-041-0/+4
| | | | | | | | | | | | | | | | | | | | | | | Since apply_irel is called before memcpy and mempcpy are called, we can use IFUNC memcpy and mempcpy in libc.a. * sysdeps/x86_64/memmove.S (MEMCPY_SYMBOL): Don't check SHARED. (MEMPCPY_SYMBOL): Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test memcpy and mempcpy in libc.a. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: Also include in libc.a. * sysdeps/x86_64/multiarch/memcpy-ssse3.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.c: Also include in libc.a. (__hidden_ver1): Don't use in libc.a. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (__mempcpy): Don't create a weak alias in libc.a. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Support libc.a. * sysdeps/x86_64/multiarch/mempcpy.c: Also include in libc.a. (__hidden_ver1): Don't use in libc.a.
* x86-64: Test memmove_chk and memset_chk only in libc.so [BZ #21741]H.J. Lu2017-07-101-0/+4
| | | | | | | | | | Since there are no multiarch versions of memmove_chk and memset_chk, test multiarch versions of memmove_chk and memset_chk only in libc.so. [BZ #21741] * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test memmove_chk and memset_chk only in libc.so.
* x86-64: Update comments in ifunc-impl-list.cH.J. Lu2017-07-091-16/+16
| | | | | | | All x86-64 IFUNC selectors are written in C now. Update comments to reflect it. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Update comments.
* x86-64: Implement memcmp family IFUNC selectors in CH.J. Lu2017-06-151-2/+2
| | | | | | | | | | | | | | | | | | | | | Implement memcmp family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for memcmp family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcmp-sse2. * sysdeps/x86_64/multiarch/ifunc-memcmp.h: New file. * sysdeps/x86_64/multiarch/memcmp-sse2.S: Likewise. * sysdeps/x86_64/multiarch/memcmp.c: Likewise. * sysdeps/x86_64/multiarch/wmemcmp.c: Likewise. * sysdeps/x86_64/multiarch/memcmp.S: Removed. * sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
* x86-64: Implement memset family IFUNC selectors in CH.J. Lu2017-06-151-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement memset family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for memset functions within libc. 2017-06-07 H.J. Lu <hongjiu.lu@intel.com> Erich Elsen <eriche@google.com> * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, and memset_chk-nonshared. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add test for __memset_chk_erms. Update comments. * sysdeps/x86_64/multiarch/ifunc-memset.h: New file. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.c: Likewise. * sysdeps/x86_64/multiarch/memset_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.c: Likewise. * sysdeps/x86_64/multiarch/memset.S: Removed. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset_chk_erms): New function.
* x86-64: Implement memmove family IFUNC selectors in CH.J. Lu2017-06-141-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement memmove family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for memmove family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memcpy_chk-nonshared, mempcpy_chk-nonshared and memmove_chk-nonshared. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __memmove_chk_erms, __memcpy_chk_erms and __mempcpy_chk_erms. Update comments. * sysdeps/x86_64/multiarch/ifunc-memmove.h: New file. * sysdeps/x86_64/multiarch/memcpy.c: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk.c: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.c: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Removed. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (__mempcpy_chk_erms): New function. (__memmove_chk_erms): Likewise. (__memcpy_chk_erms): New alias.
* x86-64: Correct comments in ifunc-impl-list.cH.J. Lu2017-06-091-6/+6
| | | | * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Correct comments.
* x86-64: Optimize strrchr/wcsrchr with AVX2H.J. Lu2017-06-091-0/+14
| | | | | | | | | | | | | | | | | | | | Optimize strrchr/wcsrchr with AVX2 to check 32 bytes with vector instructions. It is as fast as SSE2 version for small data sizes and up to 1X faster for large data sizes on Haswell. Select AVX2 version on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strrchr-sse2, strrchr-avx2, wcsrchr-sse2 and wcsrchr-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __strrchr_avx2, __strrchr_sse2, __wcsrchr_avx2 and __wcsrchr_sse2. * sysdeps/x86_64/multiarch/strrchr-avx2.S: New file. * sysdeps/x86_64/multiarch/strrchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strrchr.c: Likewise. * sysdeps/x86_64/multiarch/wcsrchr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcsrchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wcsrchr.c: Likewise.
* x86-64: Optimize memrchr with AVX2H.J. Lu2017-06-091-0/+7
| | | | | | | | | | | | | | | | | Optimize memrchr with AVX2 to search 32 bytes with a single vector compare instruction. It is as fast as SSE2 memrchr for small data sizes and up to 1X faster for large data sizes on Haswell. Select AVX2 memrchr on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memrchr-sse2 and memrchr-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __memrchr_avx2 and __memrchr_sse2. * sysdeps/x86_64/multiarch/memrchr-avx2.S: New file. * sysdeps/x86_64/multiarch/memrchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/memrchr.c: Likewise.
* x86-64: Optimize strchr/strchrnul/wcschr with AVX2H.J. Lu2017-06-091-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize strchr/strchrnul/wcschr with AVX2 to search 32 bytes with vector instructions. It is as fast as SSE2 versions for size <= 16 bytes and up to 1X faster for or size > 16 bytes on Haswell. Select AVX2 version on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strchr-sse2, strchrnul-sse2, strchr-avx2, strchrnul-avx2, wcschr-sse2 and wcschr-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __strchr_avx2, __strchrnul_avx2, __strchrnul_sse2, __wcschr_avx2 and __wcschr_sse2. * sysdeps/x86_64/multiarch/strchr-avx2.S: New file. * sysdeps/x86_64/multiarch/strchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strchr.c: Likewise. * sysdeps/x86_64/multiarch/strchrnul-avx2.S: Likewise. * sysdeps/x86_64/multiarch/strchrnul-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strchrnul.c: Likewise. * sysdeps/x86_64/multiarch/wcschr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcschr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wcschr.c: Likewise. * sysdeps/x86_64/multiarch/strchr.S: Removed.
* x86-64: Optimize strlen/strnlen/wcslen/wcsnlen with AVX2H.J. Lu2017-06-091-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize strlen/strnlen/wcslen/wcsnlen with AVX2 to check 32 bytes with a single vector compare instruction. It is as fast as SSE2 versions for size <= 16 bytes and up to 1X faster for or size > 16 bytes on Haswell. Select AVX2 version on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strlen-sse2, strnlen-sse2, strlen-avx2, strnlen-avx2, wcslen-sse2, wcslen-avx2 and wcsnlen-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __strlen_avx2, __strlen_sse2, __strnlen_avx2, __strnlen_sse2, __wcslen_avx2, __wcslen_sse2 and __wcsnlen_avx2. * sysdeps/x86_64/multiarch/strlen-avx2.S: New file. * sysdeps/x86_64/multiarch/strlen-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strlen.c: Likewise. * sysdeps/x86_64/multiarch/strnlen-avx2.S: Likewise. * sysdeps/x86_64/multiarch/strnlen-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strnlen.c: Likewise. * sysdeps/x86_64/multiarch/wcslen-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcslen-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wcslen.c: Likewise. * sysdeps/x86_64/multiarch/wcsnlen-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcsnlen.c (OPTIMIZE (avx2)): New. (IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast.
* x86-64: Optimize memchr/rawmemchr/wmemchr with SSE2/AVX2H.J. Lu2017-06-091-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SSE2 memchr is extended to support wmemchr. AVX2 memchr/rawmemchr/wmemchr are added to search 32 bytes with a single vector compare instruction. AVX2 memchr/rawmemchr/wmemchr are as fast as SSE2 memchr/rawmemchr/wmemchr for small sizes and up to 1.5X faster for larger sizes on Haswell and Skylake. Select AVX2 memchr/rawmemchr/wmemchr on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. * sysdeps/x86_64/memchr.S (MEMCHR): New. Depending on if USE_AS_WMEMCHR is defined. (PCMPEQ): Likewise. (memchr): Renamed to ... (MEMCHR): This. Support wmemchr if USE_AS_WMEMCHR is defined. Replace pcmpeqb with PCMPEQ. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memchr-sse2, rawmemchr-sse2, memchr-avx2, rawmemchr-avx2, wmemchr-sse4_1, wmemchr-avx2 and wmemchr-c. * sysdeps/x86_64/multiarch/ifunc-avx2.h: New file. * sysdeps/x86_64/multiarch/memchr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/memchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/memchr.c: Likewise. * sysdeps/x86_64/multiarch/rawmemchr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/rawmemchr.c: Likewise. * sysdeps/x86_64/multiarch/wmemchr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wmemchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wmemchr.c: Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memchr_avx2, __memchr_sse2, __rawmemchr_avx2, __rawmemchr_sse2, __wmemchr_avx2 and __wmemchr_sse2.
* x86-64: Rename wmemset.h to ifunc-wmemset.hH.J. Lu2017-06-071-2/+2
| | | | | | | | | | No code changes. * sysdeps/x86_64/multiarch/wmemset.c: Include ifunc-wmemset.h instead of wmemset.h. * sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise. * sysdeps/x86_64/multiarch/wmemset.h: Renamed to ... * sysdeps/x86_64/multiarch/ifunc-wmemset.h: This.
* x86-64: Move wcsnlen.S to multiarch/wcsnlen-sse4_1.SH.J. Lu2017-06-061-0/+7
| | | | | | | | | | | | | | | | Since wcsnlen.S uses pminud which is the part of SSE4.1, move wcsnlen.S to multiarch/wcsnlen-sse4_1.S. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add wcsnlen-sse4_1 and wcsnlen-c. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __wcsnlen_sse4_1 and __wcsnlen_sse2. * sysdeps/x86_64/multiarch/ifunc-sse4_1.h: New file. * sysdeps/x86_64/multiarch/wcsnlen-c.c: Likewise. * sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S: Likewise. * sysdeps/x86_64/multiarch/wcsnlen.c: Likewise. * sysdeps/x86_64/wcsnlen.S: Removed.
* x86-64: Optimize memcmp/wmemcmp with AVX2 and MOVBEH.J. Lu2017-06-051-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize x86-64 memcmp/wmemcmp with AVX2. It uses vector compare as much as possible. It is as fast as SSE4 memcmp for size <= 16 bytes and up to 2X faster for size > 16 bytes on Haswell and Skylake. Select AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. Key features: 1. For size from 2 to 7 bytes, load as big endian with movbe and bswap to avoid branches. 2. Use overlapping compare to avoid branch. 3. Use vector compare when size >= 4 bytes for memcmp or size >= 8 bytes for wmemcmp. 4. If size is 8 * VEC_SIZE or less, unroll the loop. 5. Compare 4 * VEC_SIZE at a time with the aligned first memory area. 6. Use 2 vector compares when size is 2 * VEC_SIZE or less. 7. Use 4 vector compares when size is 4 * VEC_SIZE or less. 8. Use 8 vector compares when size is 8 * VEC_SIZE or less. * sysdeps/x86/cpu-features.h (index_cpu_MOVBE): New. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcmp-avx2 and wmemcmp-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcmp_avx2 and __wmemcmp_avx2. * sysdeps/x86_64/multiarch/memcmp-avx2.S: New file. * sysdeps/x86_64/multiarch/wmemcmp-avx2.S: Likewise. * sysdeps/x86_64/multiarch/memcmp.S: Use __memcmp_avx2 on AVX 2 machines if AVX unaligned load is fast and vzeroupper is preferred. * sysdeps/x86_64/multiarch/wmemcmp.S: Use __wmemcmp_avx2 on AVX 2 machines if AVX unaligned load is fast and vzeroupper is preferred.
* x86-64: Optimize wmemset with SSE2/AVX2/AVX512H.J. Lu2017-06-051-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The difference between memset and wmemset is byte vs int. Add stubs to SSE2/AVX2/AVX512 memset for wmemset with updated constant and size: SSE2 wmemset: shl $0x2,%rdx movd %esi,%xmm0 mov %rdi,%rax pshufd $0x0,%xmm0,%xmm0 jmp entry_from_wmemset SSE2 memset: movd %esi,%xmm0 mov %rdi,%rax punpcklbw %xmm0,%xmm0 punpcklwd %xmm0,%xmm0 pshufd $0x0,%xmm0,%xmm0 entry_from_wmemset: Since the ERMS versions of wmemset requires "rep stosl" instead of "rep stosb", only the vector store stubs of SSE2/AVX2/AVX512 wmemset are added. The SSE2 wmemset is about 3X faster and the AVX2 wmemset is about 6X faster on Haswell. * include/wchar.h (__wmemset_chk): New. * sysdeps/x86_64/memset.S (VDUP_TO_VEC0_AND_SET_RETURN): Renamed to MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. (WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New. (WMEMSET_CHK_SYMBOL): Likewise. (WMEMSET_SYMBOL): Likewise. (__wmemset): Add hidden definition. (wmemset): Add weak hidden definition. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add wmemset_chk-nonshared. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __wmemset_sse2_unaligned, __wmemset_avx2_unaligned, __wmemset_avx512_unaligned, __wmemset_chk_sse2_unaligned, __wmemset_chk_avx2_unaligned and __wmemset_chk_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S (VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ... (MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This. (WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New. (WMEMSET_SYMBOL): Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S (VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ... (MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This. (WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New. (WMEMSET_SYMBOL): Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Updated. (WMEMSET_CHK_SYMBOL): New. (WMEMSET_CHK_SYMBOL (__wmemset_chk, unaligned)): Likewise. (WMEMSET_SYMBOL (__wmemset, unaligned)): Likewise. * sysdeps/x86_64/multiarch/memset.S (WMEMSET_SYMBOL): New. (libc_hidden_builtin_def): Also define __GI_wmemset and __GI___wmemset. (weak_alias): New. * sysdeps/x86_64/multiarch/wmemset.c: New file. * sysdeps/x86_64/multiarch/wmemset.h: Likewise. * sysdeps/x86_64/multiarch/wmemset_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise. * sysdeps/x86_64/wmemset.c: Likewise. * sysdeps/x86_64/wmemset_chk.c: Likewise.
* Update copyright dates with scripts/update-copyrights.Joseph Myers2017-01-011-1/+1
|
* Require binutils 2.24 to build x86-64 glibc [BZ #20139]H.J. Lu2016-07-011-16/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If assembler doesn't support AVX512DQ, _dl_runtime_resolve_avx is used to save the first 8 vector registers, which only saves the lower 256 bits of vector register, for lazy binding. When it is called on AVX512 platform, the upper 256 bits of ZMM registers are clobbered. Parameters passed in ZMM registers will be wrong when the function is called the first time. This patch requires binutils 2.24, whose assembler can store and load ZMM registers, to build x86-64 glibc. Since mathvec library needs assembler support for AVX512DQ, we disable mathvec if assembler doesn't support AVX512DQ. [BZ #20139] * config.h.in (HAVE_AVX512_ASM_SUPPORT): Renamed to ... (HAVE_AVX512DQ_ASM_SUPPORT): This. * sysdeps/x86_64/configure.ac: Require assembler from binutils 2.24 or above. (HAVE_AVX512_ASM_SUPPORT): Removed. (HAVE_AVX512DQ_ASM_SUPPORT): New. * sysdeps/x86_64/configure: Regenerated. * sysdeps/x86_64/dl-trampoline.S: Make HAVE_AVX512_ASM_SUPPORT check unconditional. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Check HAVE_AVX512DQ_ASM_SUPPORT instead of HAVE_AVX512_ASM_SUPPORT. * sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_pow8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx51: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S: Likewise.
* X86-64: Remove previous default/SSE2/AVX2 memcpy/memmoveH.J. Lu2016-06-081-46/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned.
* X86-64: Remove the previous SSE2/AVX2 memsetsH.J. Lu2016-06-081-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned.
* Add x86-64 memset with unaligned store and rep stosbH.J. Lu2016-03-311-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise.
* Add x86-64 memmove with unaligned load/store and rep movsbH.J. Lu2016-03-311-0/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise.
* Fixed build with assembler w/o AVX-512 support.Andrew Senkevich2016-01-191-0/+12
| | | | | * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Fixed build with assembler not supporting AVX-512.
* Added memcpy/memmove family optimized with AVX512 for KNL hardware.Andrew Senkevich2016-01-161-2/+20
| | | | | | | | | | | | | | | | | | | | Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk, mempcpy_chk, memmove_chk. It shows average improvement more than 30% over AVX versions on KNL hardware (performance results in the thread <https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>). * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
* Update copyright dates with scripts/update-copyrights.Joseph Myers2016-01-041-1/+1
|
* Added memset optimized with AVX512 for KNL hardware.Andrew Senkevich2015-12-191-2/+15
| | | | | | | | | | | | | | | It shows improvement up to 28% over AVX2 memset (performance results attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>). * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests. * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER, index_Prefer_No_VZEROUPPER): New. * sysdeps/x86/cpu-features.c (init_cpu_features): Set the Prefer_No_VZEROUPPER for Knights Landing.
* Remove -mavx2 configure tests.Joseph Myers2015-10-281-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | There are configure tests for the -mavx2 compiler option. AVX2 support was added in GCC 4.7, so these tests are now obsolete; this patch removes them. Tested for x86_64 and x86 (testsuite, and that installed stripped shared libraries are unchanged by the patch). * sysdeps/i386/configure.ac (libc_cv_cc_avx2): Remove configure test. * sysdeps/i386/configure: Regenerated. * sysdeps/x86_64/configure.ac (libc_cv_cc_avx2): Remove configure test. * sysdeps/x86_64/configure: Regenerated. * config.h.in (HAVE_AVX2_SUPPORT): Remove #undef. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-avx2 unconditionally instead of conditionally on [$(config-cflags-avx2) = yes]. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list) [HAVE_AVX2_SUPPORT]: Make code unconditional. * sysdeps/x86_64/multiarch/memset.S [HAVE_AVX2_SUPPORT]: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S [IS_IN (libc) && SHARED && HAVE_AVX2_SUPPORT]: Change conditional to [IS_IN (libc) && SHARED].
* Remove the unused IFUNC filesH.J. Lu2015-08-201-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sysdeps/i386/i686/multiarch/strcasestr-c.c became unused after commit 1818483b15d22016b0eae41d37ee91cc87b37510 Author: Andreas Schwab <schwab@suse.de> Date: Wed Dec 18 11:53:27 2013 +1000 Remove use of SSE4.2 functions for strstr on i686 which contains -sysdep_routines += strcspn-c strpbrk-c strspn-c strstr-c strcasestr-c +sysdep_routines += strcspn-c strpbrk-c strspn-c sysdeps/x86_64/multiarch/strcasestr.c became useless after t 584b18eb4df61ccd447db2dfe8c8a7901f8c8598 Author: Ondřej Bílka <neleai@seznam.cz> Date: Sat Dec 14 19:33:56 2013 +0100 Add strstr with unaligned loads. Fixes bug 12100. which changes sysdeps/x86_64/multiarch/strcasestr.c to libc_ifunc (__strcasestr, __strcasestr_sse2); This patch removes these file. * i386/i686/multiarch/strcasestr-c.c: Removed. * x86_64/multiarch/strcasestr.c: Likewise. * x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove strcasestr.
* Update x86_64 multiarch functions for <cpu-features.h>H.J. Lu2015-08-131-51/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch updates x86_64 multiarch functions to use the newly defined HAS_CPU_FEATURE, HAS_ARCH_FEATURE and LOAD_RTLD_GLOBAL_RO_RDX from <cpu-features.h>. * sysdeps/x86_64/fpu/multiarch/e_asin.c: Replace HAS_XXX with HAS_CPU_FEATURE/HAS_ARCH_FEATURE (XXX). * sysdeps/x86_64/fpu/multiarch/e_atan2.c: Likewise. * sysdeps/x86_64/fpu/multiarch/e_exp.c: Likewise. * sysdeps/x86_64/fpu/multiarch/e_log.c: Likewise. * sysdeps/x86_64/fpu/multiarch/e_pow.c: Likewise. * sysdeps/x86_64/fpu/multiarch/s_atan.c: Likewise. * sysdeps/x86_64/fpu/multiarch/s_fma.c: Likewise. * sysdeps/x86_64/fpu/multiarch/s_fmaf.c: Likewise. * sysdeps/x86_64/fpu/multiarch/s_sin.c: Likewise. * sysdeps/x86_64/fpu/multiarch/s_tan.c: Likewise. * sysdeps/x86_64/fpu/multiarch/s_ceil.S: Use LOAD_RTLD_GLOBAL_RO_RDX and HAS_CPU_FEATURE (SSE4_1). * sysdeps/x86_64/fpu/multiarch/s_ceilf.S: Likewise. * sysdeps/x86_64/fpu/multiarch/s_floor.S: Likewise. * sysdeps/x86_64/fpu/multiarch/s_floorf.S: Likewise. * sysdeps/x86_64/fpu/multiarch/s_nearbyint.S : Likewise. * sysdeps/x86_64/fpu/multiarch/s_nearbyintf.S: Likewise. * sysdeps/x86_64/fpu/multiarch/s_rintf.S: Likewise. * sysdeps/x86_64/fpu/multiarch/s_rintf.S : Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise. * sysdeps/x86_64/multiarch/sched_cpucount.c: Likewise. * sysdeps/x86_64/multiarch/strstr.c: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/test-multiarch.c: Likewise. * sysdeps/x86_64/multiarch/memcmp.S: Remove __init_cpu_features call. Add LOAD_RTLD_GLOBAL_RO_RDX. Replace HAS_XXX with HAS_CPU_FEATURE/HAS_ARCH_FEATURE (XXX). * sysdeps/x86_64/multiarch/memcpy.S: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86_64/multiarch/strcat.S: Likewise. * sysdeps/x86_64/multiarch/strchr.S: Likewise. * sysdeps/x86_64/multiarch/strcmp.S: Likewise. * sysdeps/x86_64/multiarch/strcpy.S: Likewise. * sysdeps/x86_64/multiarch/strcspn.S: Likewise. * sysdeps/x86_64/multiarch/strspn.S: Likewise. * sysdeps/x86_64/multiarch/wcscpy.S: Likewise. * sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
* Update copyright dates with scripts/update-copyrights.Joseph Myers2015-01-021-1/+1
|
* Improve 64bit memcpy performance for Haswell CPU with AVX instructionLing Ma2014-07-301-0/+12
| | | | | | | | | In this patch we take advantage of HSW memory bandwidth, manage to reduce miss branch prediction by avoiding using branch instructions and force destination to be aligned with avx instruction. The CPU2006 403.gcc benchmark indicates this patch improves performance from 2% to 10%.