about summary refs log tree commit diff
Commit message (Collapse)AuthorAgeFilesLines
* Count number of logical processors sharing L2 cache hjl/erms/2.23H.J. Lu2016-06-061-34/+116
| | | | | | | | | | | | | For Intel processors, when there are both L2 and L3 caches, SMT level type should be ued to count number of available logical processors sharing L2 cache. If there is only L2 cache, core level type should be used to count number of available logical processors sharing L2 cache. Number of available logical processors sharing L2 cache should be used for non-inclusive L2 and L3 caches. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Count number of available logical processors with SMT level type sharing L2 cache for Intel processors.
* Remove special L2 cache case for Knights LandingH.J. Lu2016-06-061-2/+0
| | | | | | | | | | | | | | L2 cache is shared by 2 cores on Knights Landing, which has 4 threads per core: https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing So L2 cache is shared by 8 threads on Knights Landing as reported by CPUID. We should remove special L2 cache case for Knights Landing. [BZ #18185] * sysdeps/x86/cacheinfo.c (init_cacheinfo): Don't limit threads sharing L2 cache to 2 for Knights Landing.
* Correct Intel processor level type mask from CPUIDH.J. Lu2016-06-061-1/+1
| | | | | | | | | | | | | | | Intel CPUID with EAX == 11 returns: ECX Bits 07 - 00: Level number. Same value in ECX input. Bits 15 - 08: Level type. ^^^^^^^^^^^^^^^^^^^^^^^^ This is level type. Bits 31 - 16: Reserved. Intel processor level type mask should be 0xff00, not 0xff0. [BZ #20119] * sysdeps/x86/cacheinfo.c (init_cacheinfo): Correct Intel processor level type mask for CPUID with EAX == 11.
* Check the HTT bit before counting logical threadsH.J. Lu2016-06-062-76/+85
| | | | | | | | | | | Skip counting logical threads for Intel processors if the HTT bit is 0 which indicates there is only a single logical processor. * sysdeps/x86/cacheinfo.c (init_cacheinfo): Skip counting logical threads if the HTT bit is 0. * sysdeps/x86/cpu-features.h (bit_cpu_HTT): New. (index_cpu_HTT): Likewise. (reg_HTT): Likewise.
* Remove alignments on jump targets in memsetH.J. Lu2016-06-061-32/+5
| | | | | | | | | | | | | | | X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which increases code sizes, but not necessarily improve performance. As memset benchtest data of align vs no align on various Intel and AMD processors https://sourceware.org/bugzilla/attachment.cgi?id=9277 shows that aligning jump targets isn't necessary. [BZ #20115] * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset): Remove alignments on jump targets.
* Call init_cpu_features only if SHARED is definedH.J. Lu2016-06-062-0/+8
| | | | | | | | | | In static executable, since init_cpu_features is called early from __libc_start_main, there is no need to call it again in dl_platform_init. [BZ #20072] * sysdeps/i386/dl-machine.h (dl_platform_init): Call init_cpu_features only if SHARED is defined. * sysdeps/x86_64/dl-machine.h (dl_platform_init): Likewise.
* Support non-inclusive caches on Intel processorsH.J. Lu2016-06-061-1/+11
| | | | | * sysdeps/x86/cacheinfo.c (init_cacheinfo): Check and support non-inclusive caches on Intel processors.
* Remove x86 ifunc-defines.sym and rtld-global-offsets.symH.J. Lu2016-06-068-51/+18
| | | | | | | | | | | | | | | | | | | | Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym. Remove x86 ifunc-defines.sym and rtld-global-offsets.sym. No code changes on i686 and x86-64. * sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers): Remove ifunc-defines.sym. * sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers): Likewise. * sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed. * sysdeps/x86/rtld-global-offsets.sym: Likewise. * sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise. * sysdeps/x86/Makefile (gen-as-const-headers): Remove rtld-global-offsets.sym. * sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ... * sysdeps/x86/cpu-features-offsets.sym: This. * sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h> instead of <ifunc-defines.h> and <rtld-global-offsets.h>.
* Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86H.J. Lu2016-06-062-1/+1
| | | | | | | | | | Move sysdeps/x86_64/cacheinfo.c to sysdeps/x86. No code changes on x86 and x86_64. * sysdeps/i386/cacheinfo.c: Include <sysdeps/x86/cacheinfo.c> instead of <sysdeps/x86_64/cacheinfo.c>. * sysdeps/x86_64/cacheinfo.c: Moved to ... * sysdeps/x86/cacheinfo.c: Here.
* Detect Intel Goldmont and Airmont processorsH.J. Lu2016-06-061-0/+8
| | | | | | | | | Updated from the model numbers of Goldmont and Airmont processors in Intel64 And IA-32 Processor Architectures Software Developer's Manual Volume 3 Revision 058. * sysdeps/x86/cpu-features.c (init_cpu_features): Detect Intel Goldmont and Airmont processors.
* X86-64: Add dummy memcopy.h and wordcopy.cH.J. Lu2016-04-082-0/+2
| | | | | | | | | Since x86-64 doesn't use memory copy functions, add dummy memcopy.h and wordcopy.c to reduce code size. It reduces the size of libc.so by about 1 KB. * sysdeps/x86_64/memcopy.h: New file. * sysdeps/x86_64/wordcopy.c: Likewise.
* X86-64: Remove previous default/SSE2/AVX2 memcpy/memmoveH.J. Lu2016-04-0819-1490/+394
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones, we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with the new ones. No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used before. If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2 memcpy/memmove optimized with Enhanced REP MOVSB will be used for processors with ERMS. The new AVX512 memcpy/memmove will be used for processors with AVX512 which prefer vzeroupper. Since the new SSE2 memcpy/memmove are faster than the previous default memcpy/memmove used in libc.a and ld.so, we also remove the previous default memcpy/memmove and make them the default memcpy/memmove, except that non-temporal store isn't used in ld.so. Together, it reduces the size of libc.so by about 6 KB and the size of ld.so by about 2 KB. [BZ #19776] * sysdeps/x86_64/memcpy.S: Make it dummy. * sysdeps/x86_64/mempcpy.S: Likewise. * sysdeps/x86_64/memmove.S: New file. * sysdeps/x86_64/memmove_chk.S: Likewise. * sysdeps/x86_64/multiarch/memmove.S: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S: Likewise. * sysdeps/x86_64/memmove.c: Removed. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memmove.c: Likewise. * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-sse2-unaligned, memmove-avx-unaligned, memcpy-avx-unaligned and memmove-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Replace __memmove_chk_avx512_unaligned_2 with __memmove_chk_avx512_unaligned. Remove __memmove_chk_avx_unaligned_2. Replace __memmove_chk_sse2_unaligned_2 with __memmove_chk_sse2_unaligned. Remove __memmove_chk_sse2 and __memmove_avx_unaligned_2. Replace __memmove_avx512_unaligned_2 with __memmove_avx512_unaligned. Replace __memmove_sse2_unaligned_2 with __memmove_sse2_unaligned. Remove __memmove_sse2. Replace __memcpy_chk_avx512_unaligned_2 with __memcpy_chk_avx512_unaligned. Remove __memcpy_chk_avx_unaligned_2. Replace __memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned. Remove __memcpy_chk_sse2. Remove __memcpy_avx_unaligned_2. Replace __memcpy_avx512_unaligned_2 with __memcpy_avx512_unaligned. Remove __memcpy_sse2_unaligned_2 and __memcpy_sse2. Replace __mempcpy_chk_avx512_unaligned_2 with __mempcpy_chk_avx512_unaligned. Remove __mempcpy_chk_avx_unaligned_2. Replace __mempcpy_chk_sse2_unaligned_2 with __mempcpy_chk_sse2_unaligned. Remove __mempcpy_chk_sse2. Replace __mempcpy_avx512_unaligned_2 with __mempcpy_avx512_unaligned. Remove __mempcpy_avx_unaligned_2. Replace __mempcpy_sse2_unaligned_2 with __mempcpy_sse2_unaligned. Remove __mempcpy_sse2. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support __memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned. Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms if processor has ERMS. Default to __memcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../memcpy.S. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support __memcpy_chk_avx512_unaligned_erms and __memcpy_chk_avx512_unaligned. Use __memcpy_chk_avx_unaligned_erms and __memcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __memcpy_chk_sse2_unaligned. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S Change function suffix from unaligned_2 to unaligned. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support __mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned. Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms if processor has ERMS. Default to __mempcpy_sse2_unaligned. (ENTRY): Removed. (END): Likewise. (ENTRY_CHK): Likewise. (libc_hidden_builtin_def): Likewise. Don't include ../mempcpy.S. (mempcpy): New. Add a weak alias. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support __mempcpy_chk_avx512_unaligned_erms and __mempcpy_chk_avx512_unaligned. Use __mempcpy_chk_avx_unaligned_erms and __mempcpy_chk_sse2_unaligned_erms if if processor has ERMS. Default to __mempcpy_chk_sse2_unaligned.
* X86-64: Remove the previous SSE2/AVX2 memsetsH.J. Lu2016-04-088-319/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the new SSE2/AVX2 memsets are faster than the previous ones, we can remove the previous SSE2/AVX2 memsets and replace them with the new ones. This reduces the size of libc.so by about 900 bytes. No change in IFUNC selection if SSE2 and AVX2 memsets weren't used before. If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset optimized with Enhanced REP STOSB will be used for processors with ERMS. The new AVX512 memset will be used for processors with AVX512 which prefer vzeroupper. [BZ #19881] * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded into ... * sysdeps/x86_64/memset.S: This. (__bzero): Removed. (__memset_tail): Likewise. (__memset_chk): Likewise. (memset): Likewise. (MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't defined. (MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined. * sysdeps/x86_64/multiarch/memset-avx2.S: Removed. (__memset_zero_constant_len_parameter): Check SHARED instead of PIC. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memset-avx2 and memset-sse2-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Remove __memset_chk_sse2, __memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__bzero): Enabled. * sysdeps/x86_64/multiarch/memset.S (memset): Replace __memset_sse2 and __memset_avx2 with __memset_sse2_unaligned and __memset_avx2_unaligned. Use __memset_sse2_unaligned_erms or __memset_avx2_unaligned_erms if processor has ERMS. Support __memset_avx512_unaligned_erms and __memset_avx512_unaligned. (memset): Removed. (__memset_chk): Likewise. (MEMSET_SYMBOL): New. (libc_hidden_builtin_def): Replace __memset_sse2 with __memset_sse2_unaligned. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace __memset_chk_sse2 and __memset_chk_avx2 with __memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms. Use __memset_chk_sse2_unaligned_erms or __memset_chk_avx2_unaligned_erms if processor has ERMS. Support __memset_chk_avx512_unaligned_erms and __memset_chk_avx512_unaligned.
* X86-64: Use non-temporal store in memcpy on large dataH.J. Lu2016-04-085-171/+234
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The large memcpy micro benchmark in glibc shows that there is a regression with large data on Haswell machine. non-temporal store in memcpy on large data can improve performance significantly. This patch adds a threshold to use non temporal store which is 6 times of shared cache size. When size is above the threshold, non temporal store will be used, but avoid non-temporal store if there is overlap between destination and source since destination may be in cache when source is loaded. For size below 8 vector register width, we load all data into registers and store them together. Only forward and backward loops, which move 4 vector registers at a time, are used to support overlapping addresses. For forward loop, we load the last 4 vector register width of data and the first vector register width of data into vector registers before the loop and store them after the loop. For backward loop, we load the first 4 vector register width of data and the last vector register width of data into vector registers before the loop and store them after the loop. [BZ #19928] * sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold): New. (init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6 times of shared cache size. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S (VMOVNT): New. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S (VMOVNT): Likewise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S (VMOVNT): Likewise. (VMOVU): Changed to movups for smaller code sizes. (VMOVA): Changed to movaps for smaller code sizes. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update comments. (PREFETCH): New. (PREFETCH_SIZE): Likewise. (PREFETCHED_LOAD_SIZE): Likewise. (PREFETCH_ONE_SET): Likewise. Rewrite to use forward and backward loops, which move 4 vector registers at a time, to support overlapping addresses and use non temporal store if size is above the threshold and there is no overlap between destination and source.
* X86-64: Prepare memmove-vec-unaligned-erms.SH.J. Lu2016-04-061-54/+84
| | | | | | | | | | | | | | | | Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the default memcpy, mempcpy and memmove. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S (MEMCPY_SYMBOL): New. (MEMPCPY_SYMBOL): Likewise. (MEMMOVE_CHK_SYMBOL): Likewise. Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk symbols. Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on __mempcpy symbols. Provide alias for __memcpy_chk in libc.a. Provide alias for memcpy in libc.a and ld.so. (cherry picked from commit a7d1c51482d15ab6c07e2ee0ae5e007067b18bfb)
* X86-64: Prepare memset-vec-unaligned-erms.SH.J. Lu2016-04-061-13/+19
| | | | | | | | | | | | | | Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the default memset. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (MEMSET_CHK_SYMBOL): New. Define if not defined. (__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH. Disabled fro now. Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk symbols. Properly check USE_MULTIARCH on __memset symbols. (cherry picked from commit 4af1bb06c59d24f35bf8dc55897838d926c05892)
* Force 32-bit displacement in memset-vec-unaligned-erms.SH.J. Lu2016-04-051-0/+13
| | | | | | | * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force 32-bit displacement to avoid long nop between instructions. (cherry picked from commit ec0cac9a1f4094bd0db6f77c1b329e7a40eecc10)
* Add a comment in memset-sse2-unaligned-erms.SH.J. Lu2016-04-051-0/+2
| | | | | | | * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add a comment on VMOVU and VMOVA. (cherry picked from commit 696ac774847b80cf994438739478b0c3003b5958)
* Don't put SSE2/AVX/AVX512 memmove/memset in ld.soH.J. Lu2016-04-056-32/+40
| | | | | | | | | | | | | | | | Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX and AVX512 memmove and memset in ld.so. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip if not in libc. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. (cherry picked from commit 5cd7af016d8587ff53b20ba259746f97edbddbf7)
* Fix memmove-vec-unaligned-erms.SH.J. Lu2016-04-051-24/+30
| | | | | | | | | | | | | | | | __mempcpy_erms and __memmove_erms can't be placed between __memmove_chk and __memmove it breaks __memmove_chk. Don't check source == destination first since it is less common. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: (__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk with unaligned_erms. (__memmove_erms): Skip if source == destination. (__memmove_unaligned_erms): Don't check source == destination first. (cherry picked from commit ea2785e96fa503f3a2b5dd9f3a6ca65622b3c5f2)
* Remove Fast_Copy_Backward from Intel Core processorsH.J. Lu2016-04-021-5/+1
| | | | | | | | | | | Intel Core i3, i5 and i7 processors have fast unaligned copy and copy backward is ignored. Remove Fast_Copy_Backward from Intel Core processors to avoid confusion. * sysdeps/x86/cpu-features.c (init_cpu_features): Don't set bit_arch_Fast_Copy_Backward for Intel Core proessors. (cherry picked from commit 27d3ce1467990f89126e228559dec8f84b96c60e)
* Add x86-64 memset with unaligned store and rep stosbH.J. Lu2016-04-026-1/+335
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement x86-64 memset with unaligned store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. A single file provides 2 implementations of memset, one with rep stosb and the other without rep stosb. They share the same codes when size is between 2 times of vector register size and REP_STOSB_THRESHOLD which defaults to 2KB. Key features: 1. Use overlapping store to avoid branch. 2. For size <= 4 times of vector register size, fully unroll the loop. 3. For size > 4 times of vector register size, store 4 times of vector register size at a time. [BZ #19881] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and memset-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned, __memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned, __memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned, __memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned, __memset_sse2_unaligned_erms, __memset_erms, __memset_avx2_unaligned, __memset_avx2_unaligned_erms, __memset_avx512_unaligned_erms and __memset_avx512_unaligned. * sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise. * sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise. (cherry picked from commit 830566307f038387ca0af3fd327706a8d1a2f595)
* Add x86-64 memmove with unaligned load/store and rep movsbH.J. Lu2016-04-026-1/+594
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement x86-64 memmove with unaligned load/store and rep movsb. Support 16-byte, 32-byte and 64-byte vector register sizes. When size <= 8 times of vector register size, there is no check for address overlap bewteen source and destination. Since overhead for overlap check is small when size > 8 times of vector register size, memcpy is an alias of memmove. A single file provides 2 implementations of memmove, one with rep movsb and the other without rep movsb. They share the same codes when size is between 2 times of vector register size and REP_MOVSB_THRESHOLD which is 2KB for 16-byte vector register size and scaled up by large vector register size. Key features: 1. Use overlapping load and store to avoid branch. 2. For size <= 8 times of vector register size, load all sources into registers and store them together. 3. If there is no address overlap bewteen source and destination, copy from both ends with 4 times of vector register size at a time. 4. If address of destination > address of source, backward copy 8 times of vector register size at a time. 5. Otherwise, forward copy 8 times of vector register size at a time. 6. Use rep movsb only for forward copy. Avoid slow backward rep movsb by fallbacking to backward copy 8 times of vector register size at a time. 7. Skip when address of destination == address of source. [BZ #19776] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and memmove-avx512-unaligned-erms. * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): Test __memmove_chk_avx512_unaligned_2, __memmove_chk_avx512_unaligned_erms, __memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms, __memmove_chk_sse2_unaligned_2, __memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2, __memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2, __memmove_avx512_unaligned_erms, __memmove_erms, __memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms, __memcpy_chk_avx512_unaligned_2, __memcpy_chk_avx512_unaligned_erms, __memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms, __memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms, __memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms, __memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms, __memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms, __memcpy_erms, __mempcpy_chk_avx512_unaligned_2, __mempcpy_chk_avx512_unaligned_erms, __mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms, __mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms, __mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms, __mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms, __mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and __mempcpy_erms. * sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New file. * sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likwise. * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Likwise. (cherry picked from commit 88b57b8ed41d5ecf2e1bdfc19556f9246a665ebb)
* Initial Enhanced REP MOVSB/STOSB (ERMS) supportH.J. Lu2016-04-021-0/+4
| | | | | | | | | | | | The newer Intel processors support Enhanced REP MOVSB/STOSB (ERMS) which has a feature bit in CPUID. This patch adds the Enhanced REP MOVSB/STOSB (ERMS) bit to x86 cpu-features. * sysdeps/x86/cpu-features.h (bit_cpu_ERMS): New. (index_cpu_ERMS): Likewise. (reg_ERMS): Likewise. (cherry picked from commit 0791f91dff9a77263fa8173b143d854cad902c6d)
* Make __memcpy_avx512_no_vzeroupper an aliasH.J. Lu2016-04-023-430/+404
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make __memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper to reduce code size of libc.so. * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove memcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed to ... * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This. (MEMCPY): Don't define. (MEMCPY_CHK): Likewise. (MEMPCPY): Likewise. (MEMPCPY_CHK): Likewise. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMPCPY_CHK): Renamed to ... (__mempcpy_chk_avx512_no_vzeroupper): This. (MEMCPY_CHK): Renamed to ... (__memmove_chk_avx512_no_vzeroupper): This. (MEMCPY): Renamed to ... (__memmove_avx512_no_vzeroupper): This. (__memcpy_avx512_no_vzeroupper): New alias. (__memcpy_chk_avx512_no_vzeroupper): Likewise. (cherry picked from commit 064f01b10b57ff09cda7025f484b848c38ddd57a)
* Implement x86-64 multiarch mempcpy in memcpyH.J. Lu2016-04-029-57/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | Implement x86-64 multiarch mempcpy in memcpy to share most of code. It reduces code size of libc.so. [BZ #18858] * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned and mempcpy-avx512-no-vzeroupper. * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New. (MEMPCPY): Likewise. * sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed. * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise. * sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise. (cherry picked from commit c365e615f7429aee302f8af7bf07ae262278febb)
* [x86] Add a feature bit: Fast_Unaligned_CopyH.J. Lu2016-04-023-2/+17
| | | | | | | | | | | | | | | | | | | | | On AMD processors, memcpy optimized with unaligned SSE load is slower than emcpy optimized with aligned SSSE3 while other string functions are faster with unaligned SSE load. A feature bit, Fast_Unaligned_Copy, is added to select memcpy optimized with unaligned SSE load. [BZ #19583] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel processors. Set Fast_Copy_Backward for AMD Excavator processors. * sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy): New. (index_arch_Fast_Unaligned_Copy): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check Fast_Unaligned_Copy instead of Fast_Unaligned_Load. (cherry picked from commit e41b395523040fcb58c7d378475720c2836d280c)
* tst-audit10: Fix compilation on compilers without bit_AVX512F [BZ #19860]Florian Weimer2016-04-021-1/+4
| | | | | | | | [BZ# 19860] * sysdeps/x86_64/tst-audit10.c (avx512_enabled): Always return zero if the compiler does not provide the AVX512F bit. (cherry picked from commit f327f5b47be57bc05a4077344b381016c1bb2c11)
* Don't set %rcx twice before "rep movsb"H.J. Lu2016-04-021-1/+0
| | | | | | | * sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY): Don't set %rcx twice before "rep movsb". (cherry picked from commit 3c9a4cd16cbc7b79094fec68add2df66061ab5d7)
* Set index_arch_AVX_Fast_Unaligned_Load only for Intel processorsH.J. Lu2016-04-022-74/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Since only Intel processors with AVX2 have fast unaligned load, we should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors. Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces and call get_common_indeces for other processors. Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading GLRO(dl_x86_cpu_features) in cpu-features.c. [BZ #19583] * sysdeps/x86/cpu-features.c (get_common_indeces): Remove inline. Check family before setting family, model and extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable bits here. (init_cpu_features): Replace HAS_CPU_FEATURE and HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load for Intel processors with usable AVX2. Call get_common_indeces for other processors with family == NULL. * sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro. (CPU_FEATURES_ARCH_P): Likewise. (HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P. (HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P. (cherry picked from commit f781a9e96138d8839663af5e88649ab1fbed74f8)
* Add _arch_/_cpu_ to index_*/bit_* in x86 cpu-features.hH.J. Lu2016-04-023-151/+159
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | index_* and bit_* macros are used to access cpuid and feature arrays o struct cpu_features. It is very easy to use bits and indices of cpuid array on feature array, especially in assembly codes. For example, sysdeps/i386/i686/multiarch/bcopy.S has HAS_CPU_FEATURE (Fast_Rep_String) which should be HAS_ARCH_FEATURE (Fast_Rep_String) We change index_* and bit_* to index_cpu_*/index_arch_* and bit_cpu_*/bit_arch_* so that we can catch such error at build time. [BZ #19762] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*. * sysdeps/x86/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.h (bit_*): Renamed to ... (bit_arch_*): This for feature array. (bit_*): Renamed to ... (bit_cpu_*): This for cpu array. (index_*): Renamed to ... (index_arch_*): This for feature array. (index_*): Renamed to ... (index_cpu_*): This for cpu array. [__ASSEMBLER__] (HAS_FEATURE): Add and use field. [__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE. [__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE. [!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and bit_##name with index_cpu_##name and bit_cpu_##name. [!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and bit_##name with index_arch_##name and bit_arch_##name. (cherry picked from commit 6aa3e97e2530f9917f504eb4146af119a3f27229)
* Fix tst-audit10 build when -mavx512f is not supported.Roland McGrath2016-04-022-3/+4
| | | | (cherry picked from commit 3bd80c0de2f8e7ca8020d37739339636d169957e)
* tst-audit4, tst-audit10: Compile AVX/AVX-512 code separately [BZ #19269]Florian Weimer2016-04-025-55/+112
| | | | | | | This ensures that GCC will not use unsupported instructions before the run-time check to ensure support. (cherry picked from commit 3c0f7407eedb524c9114bb675cd55b903c71daaa)
* Group AVX512 functions in .text.avx512 sectionH.J. Lu2016-04-022-2/+2
| | | | | | | | | * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Replace .text with .text.avx512. * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Likewise. (cherry picked from commit fee9eb6200f0e44a4b684903bc47fde36d46f1a5)
* x86-64: Fix memcpy IFUNC selectionH.J. Lu2016-04-021-13/+14
| | | | | | | | | | | | | | | | | | | Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. Existing selection order is updated with following selection order: 1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set. 2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set. 3. __memcpy_sse2 if SSSE3 isn't available. 4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set. 5. __memcpy_ssse3 [BZ #18880] * sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load, instead of Slow_BSF, and also check for Fast_Copy_Backward to enable __memcpy_ssse3_back. (cherry picked from commit 14a1d7cc4c4fd5ee8e4e66b777221dd32a84efe8)
* S390: Extend structs La_s390_regs / La_s390_retval with vector-registers.Stefan Liebler2016-04-014-65/+136
| | | | | | | | | | | | | | | | | | | | | | | Starting with z13, vector registers can also occur as argument registers. Thus the passed input/output register structs for la_s390_[32|64]_gnu_plt[enter|exit] functions should reflect those new registers. This patch extends these structs La_s390_regs and La_s390_retval and adjusts _dl_runtime_profile() to handle those fields in case of running on a z13 machine. (picked from upstream commit 5cdd1989d1d2f135d02e66250f37ba8e767f9772) ChangeLog: * sysdeps/s390/bits/link.h: (La_s390_vr) New typedef. (La_s390_32_regs): Append vector register lr_v24-lr_v31. (La_s390_64_regs): Likewise. (La_s390_32_retval): Append vector register lrv_v24. (La_s390_64_retval): Likeweise. * sysdeps/s390/s390-32/dl-trampoline.h (_dl_runtime_profile): Handle extended structs La_s390_32_regs and La_s390_32_retval. * sysdeps/s390/s390-64/dl-trampoline.h (_dl_runtime_profile): Handle extended structs La_s390_64_regs and La_s390_64_retval.
* S390: Save and restore fprs/vrs while resolving symbols.Stefan Liebler2016-04-017-248/+516
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On s390, no fpr/vrs were saved while resolving a symbol via _dl_runtime_resolve/_dl_runtime_profile. According to the abi, the fpr-arguments are defined as call clobbered. In leaf-functions, gcc 4.9 and newer can use fprs for saving/restoring gprs instead of saving them to the stack. If gcc do this in one of the resolver-functions, then the floating point arguments of a library-function are invalid for the first library-function-call. Thus, this patch saves/restores the fprs around the resolving code. The same could occur for vector registers. Furthermore an ifunc-resolver could also clobber the vector/floating point argument registers. Thus this patch provides the further variants _dl_runtime_resolve_vx/ _dl_runtime_profile_vx, which are used if the kernel claims, that we run on a machine with vector registers. Furthermore, if _dl_runtime_profile calls _dl_call_pltexit, the pointers to inregs-/outregs-structs were setup invalid. Now they point to the correct location in the stack-frame. Before branching back to the caller, the return values are now restored instead of containing the return values of the _dl_call_pltexit() call. On s390-32, an endless loop occurs if _dl_call_pltexit() should be called. Now, this code-path branches to this function instead of just after the preceding basr-instruction. (Picked from upstream commits 4603c51ef7989d7eb800cdd6f42aab206f891077 and d8a012c5c9e4bfc1b8db2bc6deacb85b44a2e1eb) ChangeLog: * sysdeps/s390/s390-32/dl-trampoline.S: Include dl-trampoline.h twice to create a non-vector/vector version for _dl_runtime_resolve and _dl_runtime_profile. Move implementation to ... * sysdeps/s390/s390-32/dl-trampoline.h: ... here. (_dl_runtime_resolve) Save and restore fpr/vrs. (_dl_runtime_profile) Save and restore vrs and fix some issues if _dl_call_pltexit is called. * sysdeps/s390/s390-32/dl-machine.h (elf_machine_runtime_setup): Choose the correct resolver function if running on a machine with vx. * sysdeps/s390/s390-64/dl-trampoline.S: Include dl-trampoline.h twice to create a non-vector/vector version for _dl_runtime_resolve and _dl_runtime_profile. Move implementation to ... * sysdeps/s390/s390-64/dl-trampoline.h: ... here. (_dl_runtime_resolve) Save and restore fpr/vrs. (_dl_runtime_profile) Save and restore vrs and fix some issues * sysdeps/s390/s390-64/dl-machine.h: (elf_machine_runtime_setup): Choose the correct resolver function if running on a machine with vx.
* resolv: Always set *resplen2 out parameter in send_dg [BZ #19791]Florian Weimer2016-03-283-23/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 44d20bca52ace85850012b0ead37b360e3ecd96e (Implement second fallback mode for DNS requests), there is a code path which returns early, before *resplen2 is initialized. This happens if the name server address is immediately recognized as invalid (because of lack of protocol support, or if it is a broadcast address such 255.255.255.255, or another invalid address). If this happens and *resplen2 was non-zero (which is the case if a previous query resulted in a failure), __libc_res_nquery would reuse an existing second answer buffer. This answer has been previously identified as unusable (for example, it could be an NXDOMAIN response). Due to the presence of a second answer, no name server switching will occur. The result is a name resolution failure, although a successful resolution would have been possible if name servers have been switched and queries had proceeded along the search path. The above paragraph still simplifies the situation. Before glibc 2.23, if the second answer needed malloc, the stub resolver would still attempt to reuse the second answer, but this is not possible because __libc_res_nsearch has freed it, after the unsuccessful call to __libc_res_nquerydomain, and set the buffer pointer to NULL. This eventually leads to an assertion failure in __libc_res_nquery: /* Make sure both hp and hp2 are defined */ assert((hp != NULL) && (hp2 != NULL)); If assertions are disabled, the consequence is a NULL pointer dereference on the next line. Starting with glibc 2.23, as a result of commit e9db92d3acfe1822d56d11abcea5bfc4c41cf6ca (CVE-2015-7547: getaddrinfo() stack-based buffer overflow (Bug 18665)), the second answer is always allocated with malloc. This means that the assertion failure happens with small responses as well because there is no buffer to reuse, as soon as there is a name resolution failure which triggers a search for an answer along the search path. This commit addresses the issue by ensuring that *resplen2 is initialized before the send_dg function returns. This commit also addresses a bug where an invalid second reply is incorrectly returned as a valid to the caller. (cherry picked from commit b66d837bb5398795c6b0f651bd5a5d66091d8577)
* math: don't clobber old libm.so on install [BZ #19822]Dylan Alex Simon2016-03-213-1/+9
| | | | | | | | | | | | | | When installing glibc (w/mathvec enabled) in-place on a system with a glibc w/out mathvec enabled, the install will clobber the existing libm.so (e.g., /lib64/libm-2.21.so) with a linker script. This is because libm.so is a symlink to libm.so.6 which is a symlink to the final libm-2.21.so file. When the makefile writes the linker script directly to libm.so, it gets clobbered. The simple patch below to math/Makefile fixes this. It is based on the nptl Makefile, which does exactly the same thing in a safer way. (cherry picked from commit f9378ac3773ffe998a2b3406568778ee9f77f759)
* Fix resource leak in resolver (bug 19257)Andreas Schwab2016-03-202-1/+7
| | | | | | | | The number of currently defined nameservers is stored in ->nscount, whereas ->_u._ext.nscount is set by __libc_res_nsend only after local initializations. (cherry picked from commit 5e7fdabd7df1fc6c56d104e61390bf5a6b526c38)
* Or bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARSH.J. Lu2016-03-113-1/+8
| | | | | | | | | We should turn on bit_Prefer_MAP_32BIT_EXEC in EXTRA_LD_ENVVARS without overriding other bits. [BZ #19758] * sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h (EXTRA_LD_ENVVARS): Or bit_Prefer_MAP_32BIT_EXEC.
* Define _HAVE_STRING_ARCH_mempcpy to 1 for x86H.J. Lu2016-03-113-0/+9
| | | | | | | | | | Since x86 has an optimized mempcpy and GCC can inline mempcpy on x86, define _HAVE_STRING_ARCH_mempcpy to 1 for x86. [BZ #19759] * sysdeps/x86/bits/string.h (_HAVE_STRING_ARCH_mempcpy): New. (cherry picked from commit 2b35e48c0c547b3f6f81996ce7ad7d67e24c7329)
* Mention BZ #19762 in NEWSH.J. Lu2016-03-101-0/+1
|
* Use HAS_ARCH_FEATURE with Fast_Rep_StringH.J. Lu2016-03-1010-9/+27
| | | | | | | | | | | | | | | | | | | | | | | HAS_ARCH_FEATURE, not HAS_CPU_FEATURE, should be used with Fast_Rep_String. [BZ #19762] * sysdeps/i386/i686/multiarch/bcopy.S (bcopy): Use HAS_ARCH_FEATURE with Fast_Rep_String. * sysdeps/i386/i686/multiarch/bzero.S (__bzero): Likewise. * sysdeps/i386/i686/multiarch/memcpy.S (memcpy): Likewise. * sysdeps/i386/i686/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memmove_chk.S (__memmove_chk): Likewise. * sysdeps/i386/i686/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/i386/i686/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise. * sysdeps/i386/i686/multiarch/memset.S (memset): Likewise. * sysdeps/i386/i686/multiarch/memset_chk.S (__memset_chk): Likewise. (cherry picked from commit 4e940b2f4b577f3a530e0580373f7c2d569f4d63)
* mips: terminate the FDE before the return trampoline in makecontextAurelien Jarno2016-03-093-0/+14
| | | | | | | | | | | | | | | | | | | In makecontext the FDE needs to be terminated before the return trampoline otherwise backtrace called within a context created by makecontext yields infinite backtrace. This bug has been present for a long time, stdlib/tst-makecontext did not fail until recent commit e535ce25. Tested on mips-linux-gnu and mips64el-linux-gnuabi64 and mips-linux-gnu, no regression. This fixes stdlib/tst-makecontext on MIPS. Changelog: [BZ #19792] * sysdeps/unix/sysv/linux/mips/makecontext.S (__makecontext): Terminate FDE before return label. (cherry picked from commit f8e9c4d30c28b8815e65a391416e8b15d2e7cbb8)
* Add sys/auxv.h wrapper to include/sys/Aurelien Jarno2016-03-082-0/+5
| | | | | | | | | | | | | | | The GNU libc testsuite fails to build on powerpc/ppc64/ppc64le with the following error: ../sysdeps/powerpc/test-get_hwcap.c:26:22: fatal error: sys/auxv.h: No such file or director This is because test-get_hwcap.c includes <sys/auxv.h>, but we don't provide a wrapper in include/sys. This patch adds one. Changelog: * include/sys/auxv.h: New file. (cherry picked from commit 0b8dedd38f304d796b6b9b349428bea7f1f7065f)
* sln: use stat64Hongjiu Zhang2016-03-072-2/+7
| | | | | | | | | | | | | | When using sln on some filesystems which return 64-bit inodes, the stat call might fail during install like so: .../elf/sln .../elf/symlink.list /lib32/libc.so.6: invalid destination: Value too large for defined data type /lib32/ld-linux.so.2: invalid destination: Value too large for defined data type Makefile:104: recipe for target 'install-symbolic-link' failed Switch to using stat64 all the time to avoid this. URL: https://bugs.gentoo.org/576396 (cherry picked from commit f5e753c8c3a18a1e3c715dd11bf4dc341b5c481f)
* Update NEWS.Carlos O'Donell2016-02-261-0/+11
|
* NEWS (2.23): Fix typo in bug 19048 text.Carlos O'Donell2016-02-252-1/+5
|
* Don't use long double math functions if NO_LONG_DOUBLEAndreas Schwab2016-02-252-1/+11
|