mirror/glibc - mirror of git://sourceware.org/git/glibc.git

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	x86: Fix overflow bug with wmemchr-sse2 and wmemchr-avx2 [BZ #27974]	Noah Goldstein	2021-06-23	2	-37/+98
\| \| \| \| \| \| \| \| \| \| \| \| \|	This commit fixes the bug mentioned in the previous commit. The previous implementations of wmemchr in these files relied on n * sizeof(wchar_t) which was not guranteed by the standard. The new overflow tests added in the previous commit now pass (As well as all the other tests). Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
*	x86-64: Add wcslen optimize for sse4.1	Noah Goldstein	2021-06-23	6	-36/+63
\| \| \| \| \| \| \| \| \|	No bug. This comment adds the ifunc / build infrastructure necessary for wcslen to prefer the sse4.1 implementation in strlen-vec.S. test-wcslen.c is passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
*	x86-64: Move strlen.S to multiarch/strlen-vec.S	H.J. Lu	2021-06-23	4	-242/+262
\| \| \| \| \| \| \| \|	Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1 version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants. This also removes the unused symbols, __GI___strlen_sse2 and __GI___wcsnlen_sse4_1.
*	Properly check stack alignment [BZ #27901]	H.J. Lu	2021-05-24	1	-46/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Replace if ((((uintptr_t) &_d) & (__alignof (double) - 1)) != 0) which may be optimized out by compiler, with int __attribute__ ((weak, noclone, noinline)) is_aligned (void *p, int align) { return (((uintptr_t) p) & (align - 1)) != 0; } 2. Add TEST_STACK_ALIGN_INIT to TEST_STACK_ALIGN. 3. Add a common TEST_STACK_ALIGN_INIT to check 16-byte stack alignment for both i386 and x86-64. 4. Update powerpc to use TEST_STACK_ALIGN_INIT. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
*	x86: Improve memmove-vec-unaligned-erms.S	Noah Goldstein	2021-05-23	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch changes the condition for copy 4x VEC so that if length is exactly equal to 4 * VEC_SIZE it will use the 4x VEC case instead of 8x VEC case. Results For Skylake memcpy-avx2-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 9.137 , 6.873 , New , 75.22 128 , 7 , 0 , 12.933 , 7.732 , New , 59.79 128 , 0 , 7 , 11.852 , 6.76 , New , 57.04 128 , 7 , 7 , 12.587 , 6.808 , New , 54.09 Results For Icelake memcpy-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 9.963 , 5.416 , New , 54.36 128 , 7 , 0 , 16.467 , 8.061 , New , 48.95 128 , 0 , 7 , 14.388 , 7.644 , New , 53.13 128 , 7 , 7 , 14.546 , 7.642 , New , 52.54 Results For Tigerlake memcpy-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 8.979 , 4.95 , New , 55.13 128 , 7 , 0 , 14.245 , 7.122 , New , 50.0 128 , 0 , 7 , 12.668 , 6.675 , New , 52.69 128 , 7 , 7 , 13.042 , 6.802 , New , 52.15 Results For Skylake memmove-avx2-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 6.181 , 5.691 , New , 92.07 128 , 32 , 0 , 6.165 , 5.752 , New , 93.3 128 , 0 , 7 , 13.923 , 9.37 , New , 67.3 128 , 7 , 0 , 12.049 , 10.182 , New , 84.5 Results For Icelake memmove-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 5.479 , 4.889 , New , 89.23 128 , 32 , 0 , 5.127 , 4.911 , New , 95.79 128 , 0 , 7 , 18.885 , 13.547 , New , 71.73 128 , 7 , 0 , 15.565 , 14.436 , New , 92.75 Results For Tigerlake memmove-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 5.275 , 4.815 , New , 91.28 128 , 32 , 0 , 5.376 , 4.565 , New , 84.91 128 , 0 , 7 , 19.426 , 14.273 , New , 73.47 128 , 7 , 0 , 15.924 , 14.951 , New , 93.89 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
*	x86: Improve memset-vec-unaligned-erms.S	Noah Goldstein	2021-05-20	1	-22/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	No bug. This commit makes a few small improvements to memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 instead of 128. Either alignment will perform equally well in a loop and 128 just increases the odds of having to do an extra iteration which can be significant overhead for small values. 2) Align some targets and the loop. 3) Remove an ALU from the alignment process. 4) Reorder the last 4x VEC so that they are stored after the loop. 5) Move the condition for leq 8x VEC to before the alignment process. test-memset and test-wmemset are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
*	x86: Optimize memcmp-evex-movbe.S	Noah Goldstein	2021-05-18	1	-302/+408
\| \| \| \| \| \| \| \| \| \| \| \|	No bug. This commit optimizes memcmp-evex.S. The optimizations include adding a new vec compare path for small sizes, reorganizing the entry control flow, removing some unnecissary ALU instructions from the main loop, and most importantly replacing the heavy use of vpcmp + kand logic with vpxor + vptern. test-memcmp and test-wmemcmp are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
*	x86: Optimize memcmp-avx2-movbe.S	Noah Goldstein	2021-05-18	3	-281/+402
\| \| \| \| \| \| \| \| \| \|	No bug. This commit optimizes memcmp-avx2.S. The optimizations include adding a new vec compare path for small sizes, reorganizing the entry control flow, and removing some unnecissary ALU instructions from the main loop. test-memcmp and test-wmemcmp are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
*	elf: Use relaxed atomics for racy accesses [BZ #19329]	Szabolcs Nagy	2021-05-11	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a follow up patch to the fix for bug 19329. This adds relaxed MO atomics to accesses that were previously data races but are now race conditions, and where relaxed MO is sufficient. The race conditions all follow the pattern that the write is behind the dlopen lock, but a read can happen concurrently (e.g. during tls access) without holding the lock. For slotinfo entries the read value only matters if it reads from a synchronized write in dlopen or dlclose, otherwise the related dtv entry is not valid to access so it is fine to leave it in an inconsistent state. The same applies for GL(dl_tls_max_dtv_idx) and GL(dl_tls_generation), but there the algorithm relies on the fact that the read of the last synchronized write is an increasing value. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
*	x86: Add EVEX optimized memchr family not safe for RTM	Noah Goldstein	2021-05-08	10	-41/+217
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	No bug. This commit adds a new implementation for EVEX memchr that is not safe for RTM because it uses vzeroupper. The benefit is that by using ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is faster than the RTM safe version which cannot use vpcmpeq because there is no EVEX encoding for the instruction. All parts of the implementation aside from the 4x loop are the same for the two versions and the optimization is only relevant for large sizes. Tigerlake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 9.2 , 9.04 , no-RTM , 0.16 512 , 7 , 224 , 9.19 , 8.98 , no-RTM , 0.21 2048 , 0 , 256 , 10.74 , 10.54 , no-RTM , 0.2 2048 , 0 , 512 , 14.81 , 14.87 , RTM , 0.06 2048 , 0 , 1024 , 22.97 , 22.57 , no-RTM , 0.4 2048 , 0 , 2048 , 37.49 , 34.51 , no-RTM , 2.98 <-- Icelake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 7.6 , 7.3 , no-RTM , 0.3 512 , 7 , 224 , 7.63 , 7.27 , no-RTM , 0.36 2048 , 0 , 256 , 8.48 , 8.38 , no-RTM , 0.1 2048 , 0 , 512 , 11.57 , 11.42 , no-RTM , 0.15 2048 , 0 , 1024 , 17.92 , 17.38 , no-RTM , 0.54 2048 , 0 , 2048 , 30.37 , 27.34 , no-RTM , 3.03 <-- test-memchr, test-wmemchr, and test-rawmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
*	x86-64: Fix an unknown vector operation in memchr-evex.S	Alice Xu	2021-05-07	1	-1/+1
\| \| \| \| \| \| \|	An unknown vector operation occurred in commit 2a76821c308. Fixed it by using "ymm{k1}{z}" but not "ymm {k1} {z}". Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
*	Remove architecture specific sched_cpucount optimizations	Adhemerval Zanella	2021-05-07	2	-61/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	And replace the generic algorithm with the Brian Kernighan's one. GCC optimize it with popcnt if the architecture supports, so there is no need to add the extra POPCNT define to enable it. This is really a micro-optimization that only adds complexity: recent ABIs already support it (x86-64-v2 or power64le) and it simplifies the code for internal usage, since i686 does not allow an internal iFUNC call. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu.
*	x86: Optimize memchr-evex.S	Noah Goldstein	2021-05-03	1	-225/+322
\| \| \| \| \| \| \| \| \| \| \| \|	No bug. This commit optimizes memchr-evex.S. The optimizations include replacing some branches with cmovcc, avoiding some branches entirely in the less_4x_vec case, making the page cross logic less strict, saving some ALU in the alignment process, and most importantly increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and test-wmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
*	x86: Optimize memchr-avx2.S	Noah Goldstein	2021-05-03	1	-178/+247
\| \| \| \| \| \| \| \| \| \| \|	No bug. This commit optimizes memchr-avx2.S. The optimizations include replacing some branches with cmovcc, avoiding some branches entirely in the less_4x_vec case, making the page cross logic less strict, asaving a few instructions the in loop return loop. test-memchr, test-rawmemchr, and test-wmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
*	regenerate ulps on x86_64 with -march=native	Paul Zimmermann	2021-04-28	1	-3/+3
\| \| \| \| \| \| \|	On x86_64, when configuring glibc with CFLAGS="-O2 -g -march=native", some tests fail. After this patch, "make check" succeeds. Tested on Intel Core i5-4590 with gcc 10.2.1.
*	x86: Optimize strchr-evex.S	Noah Goldstein	2021-04-25	1	-174/+218
\| \| \| \| \| \| \| \| \| \|	No bug. This commit optimizes strchr-evex.S. The optimizations are mostly small things such as save an ALU in the alignment process, saving a few instructions in the loop return. The one significant change is saving 2 instructions in the 4x loop. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
*	x86: Optimize strchr-avx2.S	Noah Goldstein	2021-04-25	1	-117/+169
\| \| \| \| \| \| \| \| \| \|	No bug. This commit optimizes strchr-avx2.S. The optimizations are all small things such as save an ALU in the alignment process, saving a few instructions in the loop return, saving some bytes in the main loop, and increasing the ILP in the return cases. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
*	nptl: Move pthread_spin_trylock into libc	Florian Weimer	2021-04-23	1	-3/+10
\| \| \| \|	The symbol was moved using scripts/move-symbol-to-libc.py.
*	nptl: Move pthread_spin_lock into libc	Florian Weimer	2021-04-23	1	-2/+8
\| \| \| \|	The symbol was moved using scripts/move-symbol-to-libc.py.
*	nptl: Move pthread_spin_init, Move pthread_spin_unlock into libc	Florian Weimer	2021-04-23	1	-5/+11
\| \| \| \| \| \| \|	For some architectures, the two functions are aliased, so these symbols need to be moved at the same time. The symbols were moved using scripts/move-symbol-to-libc.py.
*	x86: Remove low-level lock optimization	Florian Weimer	2021-04-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current approach is to do this optimizations at a higher level, in generic code, so that single-threaded cases can be specifically targeted. Furthermore, using IS_IN (libc) as a compile-time indicator that all locks are private is no longer correct once process-shared lock implementations are moved into libc. The generic <lowlevellock.h> is not compatible with assembler code (obviously), so it's necessary to remove two long-unused #includes. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
*	elf: Remove lazy tlsdesc relocation related code	Szabolcs Nagy	2021-04-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \|	Remove generic tlsdesc code related to lazy tlsdesc processing since lazy tlsdesc relocation is no longer supported. This includes removing GL(dl_load_lock) from _dl_make_tlsdesc_dynamic which is only called at load time when that lock is already held. Added a documentation comment too. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
*	x86: Optimize strlen-avx2.S	Noah Goldstein	2021-04-19	2	-214/+334
\| \| \| \| \| \| \| \| \| \|	No bug. This commit optimizes strlen-avx2.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
*	x86: Optimize strlen-evex.S	Noah Goldstein	2021-04-19	1	-264/+317
\| \| \| \| \| \| \| \| \| \|	No bug. This commit optimizes strlen-evex.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
*	x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.S	Noah Goldstein	2021-04-19	5	-27/+74
\| \| \| \| \| \| \| \|	No bug. This commit adds optimized cased for less_vec memset case that uses the avx512vl/avx512bw mask store avoiding the excessive branches. test-memset and test-wmemset are passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
*	x86-64: Require BMI2 for strchr-avx2.S	H.J. Lu	2021-04-19	2	-5/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since strchr-avx2.S updated by commit 1f745ecc2109890886b161d4791e1406fdfc29b8 Author: noah <goldstein.w.n@gmail.com> Date: Wed Feb 3 00:38:59 2021 -0500 x86-64: Refactor and improve performance of strchr-avx2.S uses sarx: c4 e2 72 f7 c0 sarx %ecx,%eax,%eax for strchr-avx2 family functions, require BMI2 in ifunc-impl-list.c and ifunc-avx2.h.
*	x86-64: Require BMI2 for __strlen_evex and __strnlen_evex	H.J. Lu	2021-04-19	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since __strlen_evex and __strnlen_evex added by commit 1fd8c163a83d96ace1ff78fa6bac7aee084f6f77 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 5 06:24:52 2021 -0800 x86-64: Add ifunc-avx2.h functions with 256-bit EVEX use sarx: c4 e2 6a f7 c0 sarx %edx,%eax,%eax require BMI2 for __strlen_evex and __strnlen_evex in ifunc-impl-list.c. ifunc-avx2.h already requires BMI2 for EVEX implementation.
*	x86: Update large memcpy case in memmove-vec-unaligned-erms.S	noah	2021-04-16	1	-73/+265
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	No Bug. This commit updates the large memcpy case (no overlap). The update is to perform memcpy on either 2 or 4 contiguous pages at once. This 1) helps to alleviate the affects of false memory aliasing when destination and source have a close 4k alignment and 2) In most cases and for most DRAM units is a modestly more efficient access pattern. These changes are a clear performance improvement for VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
*	x86_64: Remove lazy tlsdesc relocation related code	Szabolcs Nagy	2021-04-15	4	-219/+2
\| \| \| \| \|	_dl_tlsdesc_resolve_rela and _dl_tlsdesc_resolve_hold are only used for lazy tlsdesc relocation processing which is no longer supported.
*	x86_64: Avoid lazy relocation of tlsdesc [BZ #27137]	Szabolcs Nagy	2021-04-15	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \|	Lazy tlsdesc relocation is racy because the static tls optimization and tlsdesc management operations are done without holding the dlopen lock. This similar to the commit b7cf203b5c17dd6d9878537d41e0c7cc3d270a67 for aarch64, but it fixes a different race: bug 27137. Another issue is that ld auditing ignores DT_BIND_NOW and thus tries to relocate tlsdesc lazily, but that does not work in a BIND_NOW module due to missing DT_TLSDESC_PLT. Unconditionally relocating tlsdesc at load time fixes this bug 27721 too.
*	Improve the accuracy of tgamma (BZ #26983)	Paul Zimmermann	2021-04-07	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	With this patch, the maximal known error for tgamma is now reduced to 9 ulps for dbl-64, for all rounding modes. Since exhaustive testing is not possible for dbl-64, it might be that there are still cases with an error larger than 9 ulps, but all known cases are fixed (intensive tests were done to find cases with large errors). Tested on x86_64 and powerpc (and by Adhemerval Zanella on aarch64, arm, s390x, sparc, and i686). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
*	Fix the inaccuracy of j0f/j1f/y0f/y1f [BZ #14469, #14470, #14471, #14472]	Paul Zimmermann	2021-04-02	1	-38/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For j0f/j1f/y0f/y1f, the largest error for all binary32 inputs is reduced to at most 9 ulps for all rounding modes. The new code is enabled only when there is a cancellation at the very end of the j0f/j1f/y0f/y1f computation, or for very large inputs, thus should not give any visible slowdown on average. Two different algorithms are used: * around the first 64 zeros of j0/j1/y0/y1, approximation polynomials of degree 3 are used, computed using the Sollya tool (https://www.sollya.org/) * for large inputs, an asymptotic formula from [1] is used [1] Fast and Accurate Bessel Function Computation, John Harrison, Proceedings of Arith 19, 2009. Inputs yielding the new largest errors are added to auto-libm-test-in, and ulps are regenerated for various targets (thanks Adhemerval Zanella). Tested on x86_64 with --disable-multi-arch and on powerpc64le-linux-gnu. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
*	x86-64: Fix ifdef indentation in strlen-evex.S	Sunil K Pandey	2021-04-01	1	-8/+8
\| \| \| \| \|	Fix some indentations of ifdef in file strlen-evex.S which are off by 1 and confusing to read.
*	x86_64: Correct THREAD_SETMEM/THREAD_SETMEM_NC for movq [BZ #27591]	H.J. Lu	2021-04-01	3	-2/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	config/i386/constraints.md in GCC has (define_constraint "e" "32-bit signed integer constant, or a symbolic reference known to fit that range (for immediate operands in sign-extending x86-64 instructions)." (match_operand 0 "x86_64_immediate_operand")) Since movq takes a signed 32-bit immediate or a register source operand, use "er", instead of "nr"/"ir", constraint for 32-bit signed integer constant or register on movq. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
*	x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions	H.J. Lu	2021-03-29	3	-19/+42
\| \| \| \| \| \|	Update ifunc-memmove.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit.
*	x86-64: Use ZMM16-ZMM31 in AVX512 memset family functions	H.J. Lu	2021-03-29	4	-24/+31
\| \| \| \| \| \| \|	Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
*	x86-64: Add AVX optimized string/memory functions for RTM	H.J. Lu	2021-03-29	52	-248/+670
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX optimized string/memory functions with xtest jz 1f vzeroall ret 1: vzeroupper ret at function exit on processors with usable RTM, but without 256-bit EVEX instructions to avoid VZEROUPPER inside a transactionally executing RTM region.
*	x86-64: Add memcmp family functions with 256-bit EVEX	H.J. Lu	2021-03-29	5	-4/+467
\| \| \| \| \| \| \|	Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function exit.
*	x86-64: Add memset family functions with 256-bit EVEX	H.J. Lu	2021-03-29	6	-14/+90
\| \| \| \| \| \| \|	Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
*	x86-64: Add memmove family functions with 256-bit EVEX	H.J. Lu	2021-03-29	5	-11/+104
\| \| \| \| \| \|	Update ifunc-memmove.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit.
*	x86-64: Add strcpy family functions with 256-bit EVEX	H.J. Lu	2021-03-29	9	-3/+1339
\| \| \| \| \| \|	Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
*	x86-64: Add ifunc-avx2.h functions with 256-bit EVEX	H.J. Lu	2021-03-29	24	-17/+2995
\| \| \| \| \| \| \| \| \| \|	Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW and BMI2 since VZEROUPPER isn't needed at function exit. For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP is set.
*	nptl: Remove MULTI_PAGE_ALIASING [BZ #23554]	H.J. Lu	2021-03-19	1	-1/+0
\| \| \| \| \|	MULTI_PAGE_ALIASING was introduced to mitigate an aliasing issue on Pentium 4. It is no longer needed for processors after Pentium 4.
*	math: Remove mpa files [BZ #15267]	Wilco Dijkstra	2021-03-11	18	-183/+3
\| \| \| \| \| \| \|	Finally remove all mpa related files, headers, declarations, probes, unused tables and update makefiles. Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
*	math: Remove slow paths from asin and acos [BZ #15267]	Wilco Dijkstra	2021-03-11	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \|	This patch series removes all remaining slow paths and related code. First asin/acos, tan, atan, atan2 implementations are updated, and the final patch removes the unused mpa files, headers and probes. Passes buildmanyglibc. Remove slow paths from asin/acos. Add ULP annotations based on previous slow path checks (which are approximate). Update AArch64 and x86_64 libm-test-ulps. Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
*	Add inputs that generate larger error bounds	Paul Zimmermann	2021-02-27	1	-24/+26
\| \| \| \|	(Using values from https://members.loria.fr/PZimmermann/papers/accuracy.pdf)
*	Reduce the statically linked startup code [BZ #23323]	Florian Weimer	2021-02-25	1	-8/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It turns out the startup code in csu/elf-init.c has a perfect pair of ROP gadgets (see Marco-Gisbert and Ripoll-Ripoll, "return-to-csu: A New Method to Bypass 64-bit Linux ASLR"). These functions are not needed in dynamically-linked binaries because DT_INIT/DT_INIT_ARRAY are already processed by the dynamic linker. However, the dynamic linker skipped the main program for some reason. For maximum backwards compatibility, this is not changed, and instead, the main map is consulted from __libc_start_main if the init function argument is a NULL pointer. For statically linked binaries, the old approach based on linker symbols is still used because there is nothing else available. A new symbol version __libc_start_main@@GLIBC_2.34 is introduced because new binaries running on an old libc would not run their ELF constructors, leading to difficult-to-debug issues.
*	x86: Use x86/nptl/pthreaddef.h	H.J. Lu	2021-02-22	1	-47/+0
\| \| \| \| \| \| \|	1. Move sysdeps/i386/nptl/pthreaddef.h to sysdeps/x86/nptl/pthreaddef.h. 2. Remove sysdeps/x86_64/nptl/pthreaddef.h. Reviewed-by: DJ Delorie <dj@redhat.com>
*	x86-64: Refactor and improve performance of strchr-avx2.S	noah	2021-02-08	2	-113/+113
\| \| \| \| \| \| \| \|	No bug. Just seemed the performance could be improved a bit. Observed and expected behavior are unchanged. Optimized body of main loop. Updated page cross logic and optimized accordingly. Made a few minor instruction selection modifications. No regressions in test suite. Both test-strchrnul and test-strchr passed.
*	x86: Adding an upper bound for Enhanced REP MOVSB.	Sajan Karumanchi	2021-02-02	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the process of optimizing memcpy for AMD machines, we have found the vector move operations are outperforming enhanced REP MOVSB for data transfers above the L2 cache size on Zen3 architectures. To handle this use case, we are adding an upper bound parameter on enhanced REP MOVSB:'__x86_rep_movsb_stop_threshold'. As per large-bench results, we are configuring this parameter to the L2 cache size for AMD machines and applicable from Zen3 architecture supporting the ERMS feature. For architectures other than AMD, it is the computed value of non-temporal threshold parameter. Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>