about summary refs log tree commit diff
path: root/sysdeps/x86_64/multiarch
Commit message (Collapse)AuthorAgeFilesLines
* x86-64: Optimize memcmp-avx2-movbe.S for short differenceH.J. Lu2017-06-271-56/+62
| | | | | | | | | | | | | Check the first 32 bytes before checking size when size >= 32 bytes to avoid unnecessary branch if the difference is in the first 32 bytes. Replace vpmovmskb/subl/jnz with vptest/jnc. On Haswell, the new version is as fast as the previous one. On Skylake, the new version is a little bit faster. * sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check the first 32 bytes before checking size when size >= 32 bytes. Replace vpmovmskb/subl/jnz with vptest/jnc.
* x86-64: Optimize L(between_2_3) in memcmp-avx2-movbe.SH.J. Lu2017-06-231-4/+2
| | | | | | | | | | | | | | | | | Turn movzbl -1(%rdi, %rdx), %edi movzbl -1(%rsi, %rdx), %esi orl %edi, %eax orl %esi, %ecx into movb -1(%rdi, %rdx), %al movb -1(%rsi, %rdx), %cl * sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (between_2_3): Replace movzbl and orl with movb.
* x86-64: Fix comment typo in memcmp-avx2-movbe.SFlorian Weimer2017-06-231-1/+1
|
* x86-64: memcmp-avx2-movbe.S needs saturating subtraction [BZ #21662]Florian Weimer2017-06-231-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | This code: L(between_2_3): /* Load as big endian with overlapping loads and bswap to avoid branches. */ movzwl -2(%rdi, %rdx), %eax movzwl -2(%rsi, %rdx), %ecx shll $16, %eax shll $16, %ecx movzwl (%rdi), %edi movzwl (%rsi), %esi orl %edi, %eax orl %esi, %ecx bswap %eax bswap %ecx subl %ecx, %eax ret needs a saturating subtract because the full register is used. With this commit, only the lower 24 bits of the register are used, so a regular subtraction suffices. The test case change adds coverage for these kinds of bugs.
* x86-64: Implement strcmp family IFUNC selectors in CH.J. Lu2017-06-2124-240/+605
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement strcmp family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for strcmp family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcmp-sse2, strcmp-sse4_2, strncmp-sse2, strncmp-sse4_2, strcasecmp_l-sse2, strcasecmp_l-sse4_2, strcasecmp_l-avx, strncase_l-sse2, strncase_l-sse4_2 and strncase_l-avx. * sysdeps/x86_64/multiarch/ifunc-strcasecmp.h: New file. * sysdeps/x86_64/multiarch/strcasecmp.c: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l-avx.S: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l-sse4_2.S: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l.c: Likewise. * sysdeps/x86_64/multiarch/strcmp-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strcmp-sse4_2.S: Likewise. * sysdeps/x86_64/multiarch/strcmp.c: Likewise. * sysdeps/x86_64/multiarch/strncase.c: Likewise. * sysdeps/x86_64/multiarch/strncase_l-avx.S : Likewise. * sysdeps/x86_64/multiarch/strncase_l-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strncase_l-sse4_2.S: Likewise. * sysdeps/x86_64/multiarch/strncase_l.c: Likewise. * sysdeps/x86_64/multiarch/strncmp-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strncmp-sse4_2.S: Likewise. * sysdeps/x86_64/multiarch/strncmp.c: Likewise. * sysdeps/x86_64/multiarch/strcasecmp_l.S: Removed. * sysdeps/x86_64/multiarch/strcmp.S: Likewise. * sysdeps/x86_64/multiarch/strncase_l.S: Likewise. * sysdeps/x86_64/multiarch/strncmp.S: Likewise. * sysdeps/x86_64/multiarch/strcmp-sse42.S: Include <sysdep.h>. (STRCMP_SSE42): New. Defined to __strcmp_sse42 if not defined. [USE_AS_STRCASECMP_L || USE_AS_STRNCASECMP_L]: Include "locale-defines.h". (UPDATE_STRNCMP_COUNTER): New. (SECTION): Likewise. (GLABEL): Likewise. (LABEL): Likewise. * sysdeps/x86_64/multiarch/strncmp-ssse3.S: Rewrite and enable for libc.a.
* Fix fallout from bits/string.h removal.Zack Weinberg2017-06-202-0/+4
| | | | | | | | | | | | | | | Remove one more string inline that was defined directly in string.h; in the absence of the rest of the inlines, it broke the build. Like other ifunc shims for these functions, x86_64/multiarch/{mem,st}pcpy.c need to define __NO_STRING_INLINES and NO_MEMPCPY_STPCPY_REDIRECT. * string/string.h (__mempcpy_inline): Delete. * sysdeps/x86_64/multiarch/mempcpy.c * sysdeps/x86_64/multiarch/stpcpy.c: Define NO_MEMPCPY_STPCPY_REDIRECT and __NO_STRING_INLINES before including string.h.
* Remove bits/string.h.Zack Weinberg2017-06-201-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These machine-dependent inline string functions have never been on by default, and even if they were a good idea at the time they were introduced, they haven't really been touched in ten to fifteen years and probably aren't a good idea on current-gen processors. Current thinking is that this class of optimization is best left to the compiler. * bits/string.h, string/bits/string.h * sysdeps/aarch64/bits/string.h * sysdeps/m68k/m680x0/m68020/bits/string.h * sysdeps/s390/bits/string.h, sysdeps/sparc/bits/string.h * sysdeps/x86/bits/string.h: Delete file. * string/string.h: Don't include bits/string.h. * string/bits/string3.h: Rename to bits/string_fortified.h. No need to undef various symbols that the removed headers might have defined as macros. * string/Makefile (headers): Remove bits/string.h, change bits/string3.h to bits/string_fortified.h. * string/string-inlines.c: Update commentary. Remove definitions of various macros that nothing looks at anymore. Don't directly include bits/string.h. Set _STRING_INLINE_unaligned here, based on compiler-predefined macros. * string/strncat.c: If STRNCAT is not defined, or STRNCAT_PRIMARY _is_ defined, provide internal hidden alias __strncat. * include/string.h: Declare internal hidden alias __strncat. Only forward __stpcpy to __builtin_stpcpy if __NO_STRING_INLINES is not defined. * include/bits/string3.h: Rename to bits/string_fortified.h, update to match above. * sysdeps/i386/string-inlines.c: Define compat symbols for everything formerly defined by sysdeps/x86/bits/string.h. Make existing definitions into compat symbols as well. Remove some no-longer-necessary messing around with macros. * sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c * sysdeps/powerpc/powerpc64/multiarch/mempcpy.c * sysdeps/powerpc/powerpc64/multiarch/stpcpy.c * sysdeps/s390/multiarch/mempcpy.c No need to define _HAVE_STRING_ARCH_mempcpy. Do define __NO_STRING_INLINES and NO_MEMPCPY_STPCPY_REDIRECT. * sysdeps/i386/i686/multiarch/strncat-c.c * sysdeps/s390/multiarch/strncat-c.c * sysdeps/x86_64/multiarch/strncat-c.c Define STRNCAT_PRIMARY. Don't change definition of libc_hidden_def.
* Fix typo when undefining weak_aliasSiddhesh Poyarekar2017-06-191-1/+1
| | | | | | The macro directive #undef was miswritten as #undefine. * sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Fix typo.
* x86-64: Implement strcspn/strpbrk/strspn IFUNC selectors in CH.J. Lu2017-06-1511-111/+211
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement strcspn/strpbrk/strspn IFUNC selectors in C All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for strcspn/strpbrk/strspn functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcspn-sse2, strpbrk-sse2 and strspn-sse2. * sysdeps/x86_64/strcspn.S (STRPBRK_P): Removed. Check USE_AS_STRPBRK instead of STRPBRK_P. * sysdeps/x86_64/strpbrk.S (USE_AS_STRPBRK): New. * sysdeps/x86_64/multiarch/ifunc-sse4_2.h: New file. * sysdeps/x86_64/multiarch/strcspn-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strcspn.c: Likewise. * sysdeps/x86_64/multiarch/strpbrk-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strpbrk.c: Likewise. * sysdeps/x86_64/multiarch/strspn-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strspn.c: Likewise. * sysdeps/x86_64/multiarch/strcspn.S: Removed. * sysdeps/x86_64/multiarch/strpbrk.S: Likewise. * sysdeps/x86_64/multiarch/strspn.S: Likewise. * sysdeps/x86_64/multiarch/strpbrk-c.c: Remove "#ifdef SHARED" and "#endif".
* x86-64: Implement wcscpy IFUNC selector in CH.J. Lu2017-06-151-17/+21
| | | | | * sysdeps/x86_64/multiarch/wcscpy.S: Removed. * sysdeps/x86_64/multiarch/wcscpy.c: New file.
* x86-64: Implement strcat family IFUNC selectors in CH.J. Lu2017-06-156-90/+95
| | | | | | | | | | | | | | | | | | | | Implement strcat family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for strcat family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcat-sse2. * sysdeps/x86_64/multiarch/strcat-sse2.S: New file. * sysdeps/x86_64/multiarch/strcat.c: Likewise. * sysdeps/x86_64/multiarch/strncat.c: Likewise. * sysdeps/x86_64/multiarch/strcat.S: Removed. * sysdeps/x86_64/multiarch/strncat.S: Likewise.
* x86-64: Implement memcmp family IFUNC selectors in CH.J. Lu2017-06-157-113/+126
| | | | | | | | | | | | | | | | | | | | | Implement memcmp family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for memcmp family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcmp-sse2. * sysdeps/x86_64/multiarch/ifunc-memcmp.h: New file. * sysdeps/x86_64/multiarch/memcmp-sse2.S: Likewise. * sysdeps/x86_64/multiarch/memcmp.c: Likewise. * sysdeps/x86_64/multiarch/wmemcmp.c: Likewise. * sysdeps/x86_64/multiarch/memcmp.S: Removed. * sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
* x86-64: Implement memset family IFUNC selectors in CH.J. Lu2017-06-1512-147/+218
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement memset family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for memset functions within libc. 2017-06-07 H.J. Lu <hongjiu.lu@intel.com> Erich Elsen <eriche@google.com> * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, and memset_chk-nonshared. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add test for __memset_chk_erms. Update comments. * sysdeps/x86_64/multiarch/ifunc-memset.h: New file. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.c: Likewise. * sysdeps/x86_64/multiarch/memset_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.c: Likewise. * sysdeps/x86_64/multiarch/memset.S: Removed. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset_chk_erms): New function.
* x86-64: Implement memmove family IFUNC selectors in CH.J. Lu2017-06-1420-474/+418
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement memmove family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for memmove family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memcpy_chk-nonshared, mempcpy_chk-nonshared and memmove_chk-nonshared. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __memmove_chk_erms, __memcpy_chk_erms and __mempcpy_chk_erms. Update comments. * sysdeps/x86_64/multiarch/ifunc-memmove.h: New file. * sysdeps/x86_64/multiarch/memcpy.c: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk.c: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy.c: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.c: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Removed. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (__mempcpy_chk_erms): New function. (__memmove_chk_erms): Likewise. (__memcpy_chk_erms): New alias.
* x86-64: Implement strcpy family IFUNC selectors in CH.J. Lu2017-06-1214-131/+258
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement strcpy family IFUNC selectors in C. All internal calls within libc.so can use IFUNC on x86-64 since unlike x86, x86-64 supports PC-relative addressing to access the GOT entry so that it can call via PLT without using an extra register. For libc.a, we can't use IFUNC for functions which are called before IFUNC has been initialized. Use IFUNC internally reduces the icache footprint since libc.so and other codes in the process use the same implementations. This patch uses IFUNC for strcpy family functions within libc. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strcpy-sse2 and stpcpy-sse2. * sysdeps/x86_64/multiarch/ifunc-unaligned-ssse3.h: New file. * sysdeps/x86_64/multiarch/stpcpy-sse2.S: Likewise. * sysdeps/x86_64/multiarch/stpcpy.c: Likewise. * sysdeps/x86_64/multiarch/stpncpy.c: Likewise. * sysdeps/x86_64/multiarch/strcpy-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strcpy.c: Likewise. * sysdeps/x86_64/multiarch/strncpy.c: Likewise. * sysdeps/x86_64/multiarch/stpcpy.S: Removed. * sysdeps/x86_64/multiarch/stpncpy.S: Likewise. * sysdeps/x86_64/multiarch/strcpy.S: Likewise. * sysdeps/x86_64/multiarch/strncpy.S: Likewise. * sysdeps/x86_64/multiarch/stpncpy-c.c (weak_alias): New. (libc_hidden_def): Always defined as empty. * sysdeps/x86_64/multiarch/strncpy-c.c (libc_hidden_builtin_def): Always Defined as empty.
* x86-64: Correct comments in ifunc-impl-list.cH.J. Lu2017-06-091-6/+6
| | | | * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Correct comments.
* x86-64: Optimize strrchr/wcsrchr with AVX2H.J. Lu2017-06-098-0/+368
| | | | | | | | | | | | | | | | | | | | Optimize strrchr/wcsrchr with AVX2 to check 32 bytes with vector instructions. It is as fast as SSE2 version for small data sizes and up to 1X faster for large data sizes on Haswell. Select AVX2 version on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strrchr-sse2, strrchr-avx2, wcsrchr-sse2 and wcsrchr-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __strrchr_avx2, __strrchr_sse2, __wcsrchr_avx2 and __wcsrchr_sse2. * sysdeps/x86_64/multiarch/strrchr-avx2.S: New file. * sysdeps/x86_64/multiarch/strrchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strrchr.c: Likewise. * sysdeps/x86_64/multiarch/wcsrchr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcsrchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wcsrchr.c: Likewise.
* x86-64: Optimize memrchr with AVX2H.J. Lu2017-06-095-0/+424
| | | | | | | | | | | | | | | | | Optimize memrchr with AVX2 to search 32 bytes with a single vector compare instruction. It is as fast as SSE2 memrchr for small data sizes and up to 1X faster for large data sizes on Haswell. Select AVX2 memrchr on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memrchr-sse2 and memrchr-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __memrchr_avx2 and __memrchr_sse2. * sysdeps/x86_64/multiarch/memrchr-avx2.S: New file. * sysdeps/x86_64/multiarch/memrchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/memrchr.c: Likewise.
* x86-64: Optimize strchr/strchrnul/wcschr with AVX2H.J. Lu2017-06-0912-58/+492
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize strchr/strchrnul/wcschr with AVX2 to search 32 bytes with vector instructions. It is as fast as SSE2 versions for size <= 16 bytes and up to 1X faster for or size > 16 bytes on Haswell. Select AVX2 version on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strchr-sse2, strchrnul-sse2, strchr-avx2, strchrnul-avx2, wcschr-sse2 and wcschr-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __strchr_avx2, __strchrnul_avx2, __strchrnul_sse2, __wcschr_avx2 and __wcschr_sse2. * sysdeps/x86_64/multiarch/strchr-avx2.S: New file. * sysdeps/x86_64/multiarch/strchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strchr.c: Likewise. * sysdeps/x86_64/multiarch/strchrnul-avx2.S: Likewise. * sysdeps/x86_64/multiarch/strchrnul-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strchrnul.c: Likewise. * sysdeps/x86_64/multiarch/wcschr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcschr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wcschr.c: Likewise. * sysdeps/x86_64/multiarch/strchr.S: Removed.
* x86-64: Optimize strlen/strnlen/wcslen/wcsnlen with AVX2H.J. Lu2017-06-0913-1/+621
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize strlen/strnlen/wcslen/wcsnlen with AVX2 to check 32 bytes with a single vector compare instruction. It is as fast as SSE2 versions for size <= 16 bytes and up to 1X faster for or size > 16 bytes on Haswell. Select AVX2 version on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add strlen-sse2, strnlen-sse2, strlen-avx2, strnlen-avx2, wcslen-sse2, wcslen-avx2 and wcsnlen-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add tests for __strlen_avx2, __strlen_sse2, __strnlen_avx2, __strnlen_sse2, __wcslen_avx2, __wcslen_sse2 and __wcsnlen_avx2. * sysdeps/x86_64/multiarch/strlen-avx2.S: New file. * sysdeps/x86_64/multiarch/strlen-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strlen.c: Likewise. * sysdeps/x86_64/multiarch/strnlen-avx2.S: Likewise. * sysdeps/x86_64/multiarch/strnlen-sse2.S: Likewise. * sysdeps/x86_64/multiarch/strnlen.c: Likewise. * sysdeps/x86_64/multiarch/wcslen-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcslen-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wcslen.c: Likewise. * sysdeps/x86_64/multiarch/wcsnlen-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wcsnlen.c (OPTIMIZE (avx2)): New. (IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast.
* x86-64: Optimize memchr/rawmemchr/wmemchr with SSE2/AVX2H.J. Lu2017-06-0912-0/+580
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SSE2 memchr is extended to support wmemchr. AVX2 memchr/rawmemchr/wmemchr are added to search 32 bytes with a single vector compare instruction. AVX2 memchr/rawmemchr/wmemchr are as fast as SSE2 memchr/rawmemchr/wmemchr for small sizes and up to 1.5X faster for larger sizes on Haswell and Skylake. Select AVX2 memchr/rawmemchr/wmemchr on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. * sysdeps/x86_64/memchr.S (MEMCHR): New. Depending on if USE_AS_WMEMCHR is defined. (PCMPEQ): Likewise. (memchr): Renamed to ... (MEMCHR): This. Support wmemchr if USE_AS_WMEMCHR is defined. Replace pcmpeqb with PCMPEQ. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memchr-sse2, rawmemchr-sse2, memchr-avx2, rawmemchr-avx2, wmemchr-sse4_1, wmemchr-avx2 and wmemchr-c. * sysdeps/x86_64/multiarch/ifunc-avx2.h: New file. * sysdeps/x86_64/multiarch/memchr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/memchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/memchr.c: Likewise. * sysdeps/x86_64/multiarch/rawmemchr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/rawmemchr.c: Likewise. * sysdeps/x86_64/multiarch/wmemchr-avx2.S: Likewise. * sysdeps/x86_64/multiarch/wmemchr-sse2.S: Likewise. * sysdeps/x86_64/multiarch/wmemchr.c: Likewise. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memchr_avx2, __memchr_sse2, __rawmemchr_avx2, __rawmemchr_sse2, __wmemchr_avx2 and __wmemchr_sse2.
* x86-64: Rename wmemset.h to ifunc-wmemset.hH.J. Lu2017-06-074-4/+4
| | | | | | | | | | No code changes. * sysdeps/x86_64/multiarch/wmemset.c: Include ifunc-wmemset.h instead of wmemset.h. * sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise. * sysdeps/x86_64/multiarch/wmemset.h: Renamed to ... * sysdeps/x86_64/multiarch/ifunc-wmemset.h: This.
* x86-64: Fold ifunc-sse4_1.h into wcsnlen.cH.J. Lu2017-06-072-35/+15
| | | | | | | | | | | | Since ifunc-sse4_1.h is included only by wcsnlen.c, we can fold it into wcsnlen.c. No code changes in wcsnlen.o. 2017-06-07 H.J. Lu <hongjiu.lu@intel.com> * sysdeps/x86_64/multiarch/ifunc-sse4_1.h: Removed and folded into ... * sysdeps/x86_64/multiarch/wcsnlen.c: Here. Don't include ifunc-sse4_1.h.
* x86-64: Move wcsnlen.S to multiarch/wcsnlen-sse4_1.SH.J. Lu2017-06-066-1/+88
| | | | | | | | | | | | | | | | Since wcsnlen.S uses pminud which is the part of SSE4.1, move wcsnlen.S to multiarch/wcsnlen-sse4_1.S. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add wcsnlen-sse4_1 and wcsnlen-c. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __wcsnlen_sse4_1 and __wcsnlen_sse2. * sysdeps/x86_64/multiarch/ifunc-sse4_1.h: New file. * sysdeps/x86_64/multiarch/wcsnlen-c.c: Likewise. * sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S: Likewise. * sysdeps/x86_64/multiarch/wcsnlen.c: Likewise. * sysdeps/x86_64/wcsnlen.S: Removed.
* x86-64: Optimize memcmp/wmemcmp with AVX2 and MOVBEH.J. Lu2017-06-056-3/+465
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize x86-64 memcmp/wmemcmp with AVX2. It uses vector compare as much as possible. It is as fast as SSE4 memcmp for size <= 16 bytes and up to 2X faster for size > 16 bytes on Haswell and Skylake. Select AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast. NB: It uses TZCNT instead of BSF since TZCNT produces the same result as BSF for non-zero input. TZCNT is faster than BSF and is executed as BSF if machine doesn't support TZCNT. Key features: 1. For size from 2 to 7 bytes, load as big endian with movbe and bswap to avoid branches. 2. Use overlapping compare to avoid branch. 3. Use vector compare when size >= 4 bytes for memcmp or size >= 8 bytes for wmemcmp. 4. If size is 8 * VEC_SIZE or less, unroll the loop. 5. Compare 4 * VEC_SIZE at a time with the aligned first memory area. 6. Use 2 vector compares when size is 2 * VEC_SIZE or less. 7. Use 4 vector compares when size is 4 * VEC_SIZE or less. 8. Use 8 vector compares when size is 8 * VEC_SIZE or less. * sysdeps/x86/cpu-features.h (index_cpu_MOVBE): New. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memcmp-avx2 and wmemcmp-avx2. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memcmp_avx2 and __wmemcmp_avx2. * sysdeps/x86_64/multiarch/memcmp-avx2.S: New file. * sysdeps/x86_64/multiarch/wmemcmp-avx2.S: Likewise. * sysdeps/x86_64/multiarch/memcmp.S: Use __memcmp_avx2 on AVX 2 machines if AVX unaligned load is fast and vzeroupper is preferred. * sysdeps/x86_64/multiarch/wmemcmp.S: Use __wmemcmp_avx2 on AVX 2 machines if AVX unaligned load is fast and vzeroupper is preferred.
* x86-64: Optimize wmemset with SSE2/AVX2/AVX512H.J. Lu2017-06-0510-8/+199
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The difference between memset and wmemset is byte vs int. Add stubs to SSE2/AVX2/AVX512 memset for wmemset with updated constant and size: SSE2 wmemset: shl $0x2,%rdx movd %esi,%xmm0 mov %rdi,%rax pshufd $0x0,%xmm0,%xmm0 jmp entry_from_wmemset SSE2 memset: movd %esi,%xmm0 mov %rdi,%rax punpcklbw %xmm0,%xmm0 punpcklwd %xmm0,%xmm0 pshufd $0x0,%xmm0,%xmm0 entry_from_wmemset: Since the ERMS versions of wmemset requires "rep stosl" instead of "rep stosb", only the vector store stubs of SSE2/AVX2/AVX512 wmemset are added. The SSE2 wmemset is about 3X faster and the AVX2 wmemset is about 6X faster on Haswell. * include/wchar.h (__wmemset_chk): New. * sysdeps/x86_64/memset.S (VDUP_TO_VEC0_AND_SET_RETURN): Renamed to MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. (WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New. (WMEMSET_CHK_SYMBOL): Likewise. (WMEMSET_SYMBOL): Likewise. (__wmemset): Add hidden definition. (wmemset): Add weak hidden definition. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add wmemset_chk-nonshared. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Add __wmemset_sse2_unaligned, __wmemset_avx2_unaligned, __wmemset_avx512_unaligned, __wmemset_chk_sse2_unaligned, __wmemset_chk_avx2_unaligned and __wmemset_chk_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S (VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ... (MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This. (WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New. (WMEMSET_SYMBOL): Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S (VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ... (MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This. (WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New. (WMEMSET_SYMBOL): Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Updated. (WMEMSET_CHK_SYMBOL): New. (WMEMSET_CHK_SYMBOL (__wmemset_chk, unaligned)): Likewise. (WMEMSET_SYMBOL (__wmemset, unaligned)): Likewise. * sysdeps/x86_64/multiarch/memset.S (WMEMSET_SYMBOL): New. (libc_hidden_builtin_def): Also define __GI_wmemset and __GI___wmemset. (weak_alias): New. * sysdeps/x86_64/multiarch/wmemset.c: New file. * sysdeps/x86_64/multiarch/wmemset.h: Likewise. * sysdeps/x86_64/multiarch/wmemset_chk-nonshared.S: Likewise. * sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise. * sysdeps/x86_64/wmemset.c: Likewise. * sysdeps/x86_64/wmemset_chk.c: Likewise.
* Correct comments in x86_64/multiarch/memcmp.SH.J. Lu2017-05-181-3/+3
| | | | | * sysdeps/x86_64/multiarch/memcmp.S (__GI_memcmp): Correct comments.
* Suppress internal declarations for most of the testsuite.Zack Weinberg2017-05-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new build module called 'testsuite'. IS_IN (testsuite) implies _ISOMAC, as do IS_IN_build and __cplusplus (which means several ad-hoc tests for __cplusplus can go away). libc-symbols.h now suppresses almost all of *itself* when _ISOMAC is defined; in particular, _ISOMAC mode does not get config.h automatically anymore. There are still quite a few tests that need to see internal gunk of one variety or another. For them, we now have 'tests-internal' and 'test-internal-extras'; files in this category will still be compiled with MODULE_NAME=nonlib, and everything proceeds as it always has. The bulk of this patch is moving tests from 'tests' to 'tests-internal'. There is also 'tests-static-internal', which has the same effect on files in 'tests-static', and 'modules-names-tests', which has the *inverse* effect on files in 'modules-names' (it's inverted because most of the things in modules-names are *not* tests). For both of these, the file must appear in *both* the new variable and the old one. There is also now a special case for when libc-symbols.h is included without MODULE_NAME being defined at all. (This happens during the creation of libc-modules.h, and also when preprocessing Versions files.) When this happens, IS_IN is set to be always false and _ISOMAC is *not* defined, which was the status quo, but now it's explicit. The remaining changes to C source files in this patch seemed likely to cause problems in the absence of the main change. They should be relatively self-explanatory. In a few cases I duplicated a definition from an internal header rather than move the test to tests-internal; this was a judgement call each time and I'm happy to change those however reviewers feel is more appropriate. * Makerules: New subdir configuration variables 'tests-internal' and 'test-internal-extras'. Test files in these categories will still be compiled with MODULE_NAME=nonlib. Test files in the existing categories (tests, xtests, test-srcs, test-extras) are now compiled with MODULE_NAME=testsuite. New subdir configuration variable 'modules-names-tests'. Files which are in both 'modules-names' and 'modules-names-tests' will be compiled with MODULE_NAME=testsuite instead of MODULE_NAME=extramodules. (gen-as-const-headers): Move to tests-internal. (do-tests-clean, common-mostlyclean): Support tests-internal. * Makeconfig (built-modules): Add testsuite. * Makefile: Change libof-check-installed-headers-c and libof-check-installed-headers-cxx to 'testsuite'. * Rules: Likewise. Support tests-internal. * benchtests/strcoll-inputs/filelist#en_US.UTF-8: Remove extra-modules.mk. * config.h.in: Don't check for __OPTIMIZE__ or __FAST_MATH__ here. * include/libc-symbols.h: Move definitions of _GNU_SOURCE, PASTE_NAME, PASTE_NAME1, IN_MODULE, IS_IN, and IS_IN_LIB to the very top of the file and rationalize their order. If MODULE_NAME is not defined at all, define IS_IN to always be false, and don't define _ISOMAC. If any of IS_IN (testsuite), IS_IN_build, or __cplusplus are true, define _ISOMAC and suppress everything else in this file, starting with the inclusion of config.h. Do check for inappropriate definitions of __OPTIMIZE__ and __FAST_MATH__ here, but only if _ISOMAC is not defined. Correct some out-of-date commentary. * include/math.h: If _ISOMAC is defined, undefine NO_LONG_DOUBLE and _Mlong_double_ before including math.h. * include/string.h: If _ISOMAC is defined, don't expose _STRING_ARCH_unaligned. Move a comment to a more appropriate location. * include/errno.h, include/stdio.h, include/stdlib.h, include/string.h * include/time.h, include/unistd.h, include/wchar.h: No need to check __cplusplus nor use __BEGIN_DECLS/__END_DECLS. * misc/sys/cdefs.h (__NTHNL): New macro. * sysdeps/m68k/m680x0/fpu/bits/mathinline.h (__m81_defun): Use __NTHNL to avoid errors with GCC 6. * elf/tst-env-setuid-tunables.c: Include config.h with _LIBC defined, for HAVE_TUNABLES. * inet/tst-checks-posix.c: No need to define _ISOMAC. * intl/tst-gettext2.c: Provide own definition of N_. * math/test-signgam-finite-c99.c: No need to define _ISOMAC. * math/test-signgam-main.c: No need to define _ISOMAC. * stdlib/tst-strtod.c: Convert to test-driver. Split locale_test to... * stdlib/tst-strtod1i.c: ...this new file. * stdlib/tst-strtod5.c: Convert to test-driver and add copyright notice. Split tests of __strtod_internal to... * stdlib/tst-strtod5i.c: ...this new file. * string/test-string.h: Include stdint.h. Duplicate definition of inhibit_loop_to_libcall here (from libc-symbols.h). * string/test-strstr.c: Provide dummy definition of libc_hidden_builtin_def when including strstr.c. * sysdeps/ia64/fpu/libm-symbols.h: Suppress entire file in _ISOMAC mode; no need to test __STRICT_ANSI__ nor __cplusplus as well. * sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h. Don't include init-arch.h. * sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h. Don't include init-arch.h. * elf/Makefile: Move tst-ptrguard1-static, tst-stackguard1-static, tst-tls1-static, tst-tls2-static, tst-tls3-static, loadtest, unload, unload2, circleload1, neededtest, neededtest2, neededtest3, neededtest4, tst-tls1, tst-tls2, tst-tls3, tst-tls6, tst-tls7, tst-tls8, tst-dlmopen2, tst-ptrguard1, tst-stackguard1, tst-_dl_addr_inside_object, and all of the ifunc tests to tests-internal. Don't add $(modules-names) to test-extras. * inet/Makefile: Move tst-inet6_scopeid_pton to tests-internal. Add tst-deadline to tests-static-internal. * malloc/Makefile: Move tst-mallocstate and tst-scratch_buffer to tests-internal. * misc/Makefile: Move tst-atomic and tst-atomic-long to tests-internal. * nptl/Makefile: Move tst-typesizes, tst-rwlock19, tst-sem11, tst-sem12, tst-sem13, tst-barrier5, tst-signal7, tst-tls3, tst-tls3-malloc, tst-tls5, tst-stackguard1, tst-sem11-static, tst-sem12-static, and tst-stackguard1-static to tests-internal. Link tests-internal with libpthread also. Don't add $(modules-names) to test-extras. * nss/Makefile: Move tst-field to tests-internal. * posix/Makefile: Move bug-regex5, bug-regex20, bug-regex33, tst-rfc3484, tst-rfc3484-2, and tst-rfc3484-3 to tests-internal. * stdlib/Makefile: Move tst-strtod1i, tst-strtod3, tst-strtod4, tst-strtod5i, tst-tls-atexit, and tst-tls-atexit-nodelete to tests-internal. * sunrpc/Makefile: Move tst-svc_register to tests-internal. * sysdeps/powerpc/Makefile: Move test-get_hwcap and test-get_hwcap-static to tests-internal. * sysdeps/unix/sysv/linux/Makefile: Move tst-setgetname to tests-internal. * sysdeps/x86_64/fpu/Makefile: Add all libmvec test modules to modules-names-tests.
* x86: Use AVX2 memcpy/memset on Skylake server [BZ #21396]H.J. Lu2017-04-188-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Skylake server, AVX512 load/store instructions in memcpy/memset may lead to lower CPU turbo frequency in certain situations. Use of AVX2 in memcpy/memset has been observed to have improved overall performance in many workloads due to the higher frequency. Since AVX512ER is unique to Xeon Phi, this patch sets Prefer_No_AVX512 if AVX512ER isn't available so that AVX2 versions of memcpy/memset are used on Skylake server. [BZ #21396] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Prefer_No_AVX512 if AVX512ER isn't available. * sysdeps/x86/cpu-features.h (bit_arch_Prefer_No_AVX512): New. (index_arch_Prefer_No_AVX512): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Don't use AVX512 version if Prefer_No_AVX512 is set. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S (__memmove_chk): Likewise. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise. * sysdeps/x86_64/multiarch/memset.S (memset): Likewise. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Likewise.
* Revert header inclusion changes that break math/ testing on x86_64.Joseph Myers2017-02-171-1/+1
| | | | | | | | | | Revert: 2017-02-16 Zack Weinberg <zackw@panix.com> * sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h. Don't include init-arch.h. * sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h. Don't include init-arch.h.
* Add missing header files throughout the testsuite.Zack Weinberg2017-02-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * crypt/md5.h: Test _LIBC with #if defined, not #if. * dirent/opendir-tst1.c: Include sys/stat.h. * dirent/tst-fdopendir.c: Include sys/stat.h. * dirent/tst-fdopendir2.c: Include stdlib.h. * dirent/tst-scandir.c: Include stdbool.h. * elf/tst-auditmod1.c: Include link.h and stddef.h. * elf/tst-tls15.c: Include stdlib.h. * elf/tst-tls16.c: Include stdlib.h. * elf/tst-tls17.c: Include stdlib.h. * elf/tst-tls18.c: Include stdlib.h. * iconv/tst-iconv6.c: Include endian.h. * iconvdata/bug-iconv11.c: Include limits.h. * io/test-utime.c: Include stdint.h. * io/tst-faccessat.c: Include sys/stat.h. * io/tst-fchmodat.c: Include sys/stat.h. * io/tst-fchownat.c: Include sys/stat.h. * io/tst-fstatat.c: Include sys/stat.h. * io/tst-futimesat.c: Include sys/stat.h. * io/tst-linkat.c: Include sys/stat.h. * io/tst-mkdirat.c: Include sys/stat.h and stdbool.h. * io/tst-mkfifoat.c: Include sys/stat.h and stdbool.h. * io/tst-mknodat.c: Include sys/stat.h and stdbool.h. * io/tst-openat.c: Include stdbool.h. * io/tst-readlinkat.c: Include sys/stat.h. * io/tst-renameat.c: Include sys/stat.h. * io/tst-symlinkat.c: Include sys/stat.h. * io/tst-unlinkat.c: Include stdbool.h. * libio/bug-memstream1.c: Include stdlib.h. * libio/bug-wmemstream1.c: Include stdlib.h. * libio/tst-fwrite-error.c: Include stdlib.h. * libio/tst-memstream1.c: Include stdlib.h. * libio/tst-memstream2.c: Include stdlib.h. * libio/tst-memstream3.c: Include stdlib.h. * malloc/tst-interpose-aux.c: Include stdint.h. * misc/tst-preadvwritev-common.c: Include sys/stat.h. * nptl/tst-basic7.c: Include limits.h. * nptl/tst-cancel25.c: Include pthread.h, not pthreadP.h. * nptl/tst-cancel4.c: Include stddef.h, limits.h, and sys/stat.h. * nptl/tst-cancel4_1.c: Include stddef.h. * nptl/tst-cancel4_2.c: Include stddef.h. * nptl/tst-cond16.c: Include limits.h. Use sysconf(_SC_PAGESIZE) instead of __getpagesize. * nptl/tst-cond18.c: Include limits.h. Use sysconf(_SC_PAGESIZE) instead of __getpagesize. * nptl/tst-cond4.c: Include stdint.h. * nptl/tst-cond6.c: Include stdint.h. * nptl/tst-stack2.c: Include limits.h. * nptl/tst-stackguard1.c: Include stddef.h. * nptl/tst-tls4.c: Include stdint.h. Don't include tls.h. * nptl/tst-tls4moda.c: Include stddef.h. Don't include stdio.h, unistd.h, or tls.h. * nptl/tst-tls4modb.c: Include stddef.h. Don't include stdio.h, unistd.h, or tls.h. * nptl/tst-tls5.h: Include stddef.h. Don't include stdlib.h or tls.h. * posix/tst-getaddrinfo2.c: Include stdio.h. * posix/tst-getaddrinfo5.c: Include stdio.h. * posix/tst-pathconf.c: Include sys/stat.h. * posix/tst-posix_fadvise-common.c: Include stdint.h. * posix/tst-preadwrite-common.c: Include sys/stat.h. * posix/tst-regex.c: Include stdint.h. Don't include spawn.h or spawn_int.h. * posix/tst-regexloc.c: Don't include spawn.h or spawn_int.h. * posix/tst-vfork3.c: Include sys/stat.h. * resolv/tst-bug18665-tcp.c: Include stdlib.h. * resolv/tst-res_hconf_reorder.c: Include stdlib.h. * resolv/tst-resolv-search.c: Include stdlib.h. * stdio-common/tst-fmemopen2.c: Include stdint.h. * stdio-common/tst-vfprintf-width-prec.c: Include stdlib.h. * stdlib/test-canon.c: Include sys/stat.h. * stdlib/tst-tls-atexit.c: Include stdbool.h. * string/test-memchr.c: Include stdint.h. * string/tst-cmp.c: Include stdint.h. * sysdeps/pthread/tst-timer.c: Include stdint.h. * sysdeps/unix/sysv/linux/tst-sync_file_range.c: Include stdint.h. * sysdeps/wordsize-64/tst-writev.c: Include limits.h and stdint.h. * sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h. Don't include init-arch.h. * sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h. Don't include init-arch.h. * sysdeps/x86_64/tst-auditmod10b.c: Include link.h and stddef.h. * sysdeps/x86_64/tst-auditmod3b.c: Include link.h and stddef.h. * sysdeps/x86_64/tst-auditmod4b.c: Include link.h and stddef.h. * sysdeps/x86_64/tst-auditmod5b.c: Include link.h and stddef.h. * sysdeps/x86_64/tst-auditmod6b.c: Include link.h and stddef.h. * sysdeps/x86_64/tst-auditmod6c.c: Include link.h and stddef.h. * sysdeps/x86_64/tst-auditmod7b.c: Include link.h and stddef.h. * time/clocktest.c: Include stdint.h. * time/tst-posixtz.c: Include stdint.h. * timezone/tst-timezone.c: Include stdint.h.
* Add VZEROUPPER to memset-vec-unaligned-erms.S [BZ #21081]H.J. Lu2017-01-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | Since memset-vec-unaligned-erms.S has VDUP_TO_VEC0_AND_SET_RETURN at function entry, memset optimized for AVX2 and AVX512 will always use ymm/zmm register. VZEROUPPER should be placed before ret in L(stosb): movq %rdx, %rcx movzbl %sil, %eax movq %rdi, %rdx rep stosb movq %rdx, %rax ret since it can be reached from L(stosb_more_2x_vec): cmpq $REP_STOSB_THRESHOLD, %rdx ja L(stosb) [BZ #21081] * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (L(stosb)): Add VZEROUPPER before ret.
* Fix x86 strncat optimized implementation for large sizesAdhemerval Zanella2017-01-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | Similar to BZ#19387, BZ#21014, and BZ#20971, both x86 sse2 strncat optimized assembly implementations do not handle the size overflow correctly. The x86_64 one is in fact an issue with strcpy-sse2-unaligned, but that is triggered also with strncat optimized implementation. This patch uses a similar strategy used on 3daef2c8ee4df2, where saturared math is used for overflow case. Checked on x86_64-linux-gnu and i686-linux-gnu. It fixes BZ #19390. [BZ #19390] * string/test-strncat.c (test_main): Add tests with SIZE_MAX as maximum string size. * sysdeps/i386/i686/multiarch/strcat-sse2.S (STRCAT): Avoid overflow in pointer addition. * sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S (STRCPY): Likewise.
* Update copyright dates with scripts/update-copyrights.Joseph Myers2017-01-0142-42/+42
|
* Require binutils 2.24 to build x86-64 glibc [BZ #20139]H.J. Lu2016-07-0113-36/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If assembler doesn't support AVX512DQ, _dl_runtime_resolve_avx is used to save the first 8 vector registers, which only saves the lower 256 bits of vector register, for lazy binding. When it is called on AVX512 platform, the upper 256 bits of ZMM registers are clobbered. Parameters passed in ZMM registers will be wrong when the function is called the first time. This patch requires binutils 2.24, whose assembler can store and load ZMM registers, to build x86-64 glibc. Since mathvec library needs assembler support for AVX512DQ, we disable mathvec if assembler doesn't support AVX512DQ. [BZ #20139] * config.h.in (HAVE_AVX512_ASM_SUPPORT): Renamed to ... (HAVE_AVX512DQ_ASM_SUPPORT): This. * sysdeps/x86_64/configure.ac: Require assembler from binutils 2.24 or above. (HAVE_AVX512_ASM_SUPPORT): Removed. (HAVE_AVX512DQ_ASM_SUPPORT): New. * sysdeps/x86_64/configure: Regenerated. * sysdeps/x86_64/dl-trampoline.S: Make HAVE_AVX512_ASM_SUPPORT check unconditional. * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise. * sysdeps/x86_64/multiarch/memcpy.S: Likewise. * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset.S: Likewise. * sysdeps/x86_64/multiarch/memset_chk.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Check HAVE_AVX512DQ_ASM_SUPPORT instead of HAVE_AVX512_ASM_SUPPORT. * sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_pow8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx51: Likewise. * sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S: Likewise.
* Check Prefer_ERMS in memmove/memcpy/mempcpy/memsetH.J. Lu2016-06-305-1/+17
| | | | | | | | | | | | | | | | | | | | Although the Enhanced REP MOVSB/STOSB (ERMS) implementations of memmove, memcpy, mempcpy and memset aren't used by the current processors, this patch adds Prefer_ERMS check in memmove, memcpy, mempcpy and memset so that they can be used in the future. * sysdeps/x86/cpu-features.h (bit_arch_Prefer_ERMS): New. (index_arch_Prefer_ERMS): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Return __memcpy_erms for Prefer_ERMS. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (__memmove_erms): Enabled for libc.a. * ysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Return __memmove_erms or Prefer_ERMS. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Return __mempcpy_erms for Prefer_ERMS. * sysdeps/x86_64/multiarch/memset.S (memset): Return __memset_erms for Prefer_ERMS.
* X86-64: Remove previous default/SSE2/AVX2 memcpy/memmoveH.J. Lu2016-06-0815-890/+306
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned.
* X86-64: Remove the previous SSE2/AVX2 memsetsH.J. Lu2016-06-087-217/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned.
* Fix a typo in comments in memmove-vec-unaligned-erms.SH.J. Lu2016-06-061-1/+1
| | | | | * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Fix a typo in comments.
* Remove alignments on jump targets in memsetH.J. Lu2016-05-191-32/+5
| | | | | | | | | | | | | | | X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which increases code sizes, but not necessarily improve performance. As memset benchtest data of align vs no align on various Intel and AMD processors https://sourceware.org/bugzilla/attachment.cgi?id=9277 shows that aligning jump targets isn't necessary. [BZ #20115] * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset): Remove alignments on jump targets.
* Remove x86 ifunc-defines.sym and rtld-global-offsets.symH.J. Lu2016-05-112-21/+0
| | | | | | | | | | | | | | | | | | | | Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym. Remove x86 ifunc-defines.sym and rtld-global-offsets.sym. No code changes on i686 and x86-64. * sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers): Remove ifunc-defines.sym. * sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers): Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed. * sysdeps/x86/rtld-global-offsets.sym: Likewise. * sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise. * sysdeps/x86/Makefile (gen-as-const-headers): Remove rtld-global-offsets.sym. * sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ... * sysdeps/x86/cpu-features-offsets.sym: This. * sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h> instead of <ifunc-defines.h> and <rtld-global-offsets.h>.
* X86-64: Use non-temporal store in memcpy on large dataH.J. Lu2016-04-124-171/+226
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. [BZ #19928] * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source.
* X86-64: Prepare memmove-vec-unaligned-erms.SH.J. Lu2016-04-061-54/+84
| | | | | | | | | | | | | | Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so.
* X86-64: Prepare memset-vec-unaligned-erms.SH.J. Lu2016-04-061-13/+19
| | | | | | | | | | | | Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols.
* Force 32-bit displacement in memset-vec-unaligned-erms.SH.J. Lu2016-04-051-0/+13
| | | | | * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions.
* Add a comment in memset-sse2-unaligned-erms.SH.J. Lu2016-04-051-0/+2
| | | | | * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA.
* Don't put SSE2/AVX/AVX512 memmove/memset in ld.soH.J. Lu2016-04-036-32/+40
| | | | | | | | | | | | | | Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise.
* Fix memmove-vec-unaligned-erms.SH.J. Lu2016-04-031-24/+30
| | | | | | | | | | | | | | __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first.
* Add x86-64 memset with unaligned store and rep stosbH.J. Lu2016-03-316-1/+335
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise.
* Add x86-64 memmove with unaligned load/store and rep movsbH.J. Lu2016-03-316-1/+594
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise.