about summary refs log tree commit diff
path: root/sysdeps/x86_64
Commit message (Collapse)AuthorAgeFilesLines
* x86-64: Add Avoid_Short_Distance_REP_MOVSBH.J. Lu2021-07-281-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | commit 3ec5d83d2a237d39e7fd6ef7a0bc8ac4c171a4a5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sat Jan 25 14:19:40 2020 -0800 x86-64: Avoid rep movsb with short distance [BZ #27130] introduced some regressions on Intel processors without Fast Short REP MOV (FSRM). Add Avoid_Short_Distance_REP_MOVSB to avoid rep movsb with short distance only on Intel processors with FSRM. bench-memmove-large on Skylake server shows that cycles of __memmove_evex_unaligned_erms improves for the following data size: before after Improvement length=4127, align1=3, align2=0: 479.38 349.25 27% length=4223, align1=9, align2=5: 405.62 333.25 18% length=8223, align1=3, align2=0: 786.12 496.38 37% length=8319, align1=9, align2=5: 727.50 501.38 31% length=16415, align1=3, align2=0: 1436.88 840.00 41% length=16511, align1=9, align2=5: 1375.50 836.38 39% length=32799, align1=3, align2=0: 2890.00 1860.12 36% length=32895, align1=9, align2=5: 2891.38 1931.88 33%
* x86: Install <bits/platform/x86.h> [BZ #27958]H.J. Lu2021-07-231-3/+3
| | | | | | | | | | | | | | | 1. Install <bits/platform/x86.h> for <sys/platform/x86.h> which includes <bits/platform/x86.h>. 2. Rename HAS_CPU_FEATURE to CPU_FEATURE_PRESENT which checks if the processor has the feature. 3. Rename CPU_FEATURE_USABLE to CPU_FEATURE_ACTIVE which checks if the feature is active. There may be other preconditions, like sufficient stack space or further setup for AMX, which must be satisfied before the feature can be used. This fixes BZ #27958. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* mtrace: Wean away from malloc hooksSiddhesh Poyarekar2021-07-221-1/+0
| | | | | | | | | | | | | Wean mtrace away from the malloc hooks and move them into the debug DSO. Split the API away from the implementation so that we can add the API to libc.so as well as libc_malloc_debug.so, with the libc implementations being empty. Update localplt data since memalign no longer has any callers after this change. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
* mcheck: Align struct hdr to MALLOC_ALIGNMENT bytes [BZ #28068]H.J. Lu2021-07-121-4/+0
| | | | | | | | | | | | 1. Align struct hdr to MALLOC_ALIGNMENT bytes so that malloc hooks in libmcheck align memory to MALLOC_ALIGNMENT bytes. 2. Remove tst-mallocalign1 from tests-exclude-mcheck for i386 and x32. 3. Add tst-pvalloc-fortify and tst-reallocarray to tests-exclude-mcheck since they use malloc_usable_size (see BZ #22057). This fixed BZ #28068. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* Add a generic malloc test for MALLOC_ALIGNMENTH.J. Lu2021-07-093-76/+4
| | | | | | | | | 1. Add sysdeps/generic/malloc-size.h to define size related macros for malloc. 2. Move x86_64/tst-mallocalign1.c to malloc and replace ALIGN_MASK with MALLOC_ALIGN_MASK. 3. Add tst-mallocalign1 to tests-exclude-mcheck for i386 and x32 since mcheck doesn't honor MALLOC_ALIGNMENT.
* x86: Remove wcsnlen-sse4_1 from wcslen ifunc-impl-list [BZ #28064]Noah Goldstein2021-07-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | The following commit commit 6f573a27b6c8b4236445810a44660612323f5a73 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:19:34 2021 -0400 x86-64: Add wcslen optimize for sse4.1 Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc implementation list and adding wcslen-sse4.1 to the ifunc implementation list. Testing: test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as well as all other tests in wcsmbs and string. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Test strlen and wcslen with 0 in the RSI register [BZ #28064]H.J. Lu2021-07-083-0/+108
| | | | | | | | | | | | | | commit 6f573a27b6c8b4236445810a44660612323f5a73 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:19:34 2021 -0400 x86-64: Add wcslen optimize for sse4.1 added wcsnlen-sse4.1 to the wcslen ifunc implementation list. Since the random value in the the RSI register is larger than the wide-character string length in the existing wcslen test, it didn't trigger the wcslen test failure. Add a test to force 0 into the RSI register before calling wcslen.
* x86_64: Remove unneeded static PIE check for undefined weak diagnosticFangrui Song2021-07-082-58/+0
| | | | | | | | | | | | | | | https://sourceware.org/bugzilla/show_bug.cgi?id=21782 dropped an ld diagnostic for R_X86_64_PC32 referencing an undefined weak symbol in -pie links. Arguably keeping the diagnostic like other ports is more correct, since statically resolving movl foo(%rip), %eax to the link-time zero address produces a corrupted output. It turns out that --enable-static-pie builds do not depend on the ld behavior. GCC generates GOT indirection for weak declarations for -fPIE/-fPIC, so what ld does with the PC-relative relocation doesn't really matter. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86_64: roundeven with sse4.1 supportShen-Ta Hsieh2021-06-277-2/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for the sse4.1 hardware floating point roundeven. Here is some benchmark results on my systems: =AMD Ryzen 9 3900X 12-Core Processor= * benchmark result before this commit | | roundeven | roundevenf | |------------|--------------|--------------| | duration | 3.75587e+09 | 3.75114e+09 | | iterations | 3.93053e+08 | 4.35402e+08 | | max | 52.592 | 58.71 | | min | 7.98 | 7.22 | | mean | 9.55563 | 8.61535 | * benchmark result after this commit | | roundeven | roundevenf | |------------|---------------|--------------| | duration | 3.73815e+09 | 3.73738e+09 | | iterations | 5.82692e+08 | 5.91498e+08 | | max | 56.468 | 51.642 | | min | 6.27 | 6.156 | | mean | 6.41532 | 6.3185 | =Intel(R) Pentium(R) CPU D1508 @ 2.20GHz= * benchmark result before this commit | | roundeven | roundevenf | |------------|--------------|--------------| | duration | 2.18208e+09 | 2.18258e+09 | | iterations | 2.39932e+08 | 2.46924e+08 | | max | 96.378 | 98.035 | | min | 6.776 | 5.94 | | mean | 9.09456 | 8.83907 | * benchmark result after this commit | | roundeven | roundevenf | |------------|--------------|--------------| | duration | 2.17415e+09 | 2.17005e+09 | | iterations | 3.56193e+08 | 4.09824e+08 | | max | 51.693 | 97.192 | | min | 5.926 | 5.093 | | mean | 6.10385 | 5.29507 | Signed-off-by: Shen-Ta Hsieh <ibmibmibm.tw@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Remove unnecessary overflow check from wcsnlen-sse4_1.SNoah Goldstein2021-06-241-7/+0
| | | | | | | | | | | | | | | | | | | | | No bug. The way wcsnlen will check if near the end of maxlen is the following macro: mov %r11, %rsi; \ subq %rax, %rsi; \ andq $-64, %rax; \ testq $-64, %rsi; \ je L(strnlen_ret) Which words independently of s + maxlen overflowing. So the second overflow check is unnecissary for correctness and just extra overhead in the common no overflow case. test-strlen.c, test-wcslen.c, test-strnlen.c and test-wcsnlen.c are all passing Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]Noah Goldstein2021-06-232-38/+107
| | | | | | | | | | | | | This commit fixes the bug mentioned in the previous commit. The previous implementations of wmemchr in these files relied on maxlen * sizeof(wchar_t) which was not guranteed by the standard. The new overflow tests added in the previous commit now pass (As well as all the other tests). Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Fix overflow bug with wmemchr-sse2 and wmemchr-avx2 [BZ #27974]Noah Goldstein2021-06-232-37/+98
| | | | | | | | | | | | | This commit fixes the bug mentioned in the previous commit. The previous implementations of wmemchr in these files relied on n * sizeof(wchar_t) which was not guranteed by the standard. The new overflow tests added in the previous commit now pass (As well as all the other tests). Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add wcslen optimize for sse4.1Noah Goldstein2021-06-236-36/+63
| | | | | | | | | No bug. This comment adds the ifunc / build infrastructure necessary for wcslen to prefer the sse4.1 implementation in strlen-vec.S. test-wcslen.c is passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Move strlen.S to multiarch/strlen-vec.SH.J. Lu2021-06-234-242/+262
| | | | | | | | Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1 version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants. This also removes the unused symbols, __GI___strlen_sse2 and __GI___wcsnlen_sse4_1.
* Properly check stack alignment [BZ #27901]H.J. Lu2021-05-241-46/+0
| | | | | | | | | | | | | | | | | | | | | | 1. Replace if ((((uintptr_t) &_d) & (__alignof (double) - 1)) != 0) which may be optimized out by compiler, with int __attribute__ ((weak, noclone, noinline)) is_aligned (void *p, int align) { return (((uintptr_t) p) & (align - 1)) != 0; } 2. Add TEST_STACK_ALIGN_INIT to TEST_STACK_ALIGN. 3. Add a common TEST_STACK_ALIGN_INIT to check 16-byte stack alignment for both i386 and x86-64. 4. Update powerpc to use TEST_STACK_ALIGN_INIT. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* x86: Improve memmove-vec-unaligned-erms.SNoah Goldstein2021-05-231-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes the condition for copy 4x VEC so that if length is exactly equal to 4 * VEC_SIZE it will use the 4x VEC case instead of 8x VEC case. Results For Skylake memcpy-avx2-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 9.137 , 6.873 , New , 75.22 128 , 7 , 0 , 12.933 , 7.732 , New , 59.79 128 , 0 , 7 , 11.852 , 6.76 , New , 57.04 128 , 7 , 7 , 12.587 , 6.808 , New , 54.09 Results For Icelake memcpy-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 9.963 , 5.416 , New , 54.36 128 , 7 , 0 , 16.467 , 8.061 , New , 48.95 128 , 0 , 7 , 14.388 , 7.644 , New , 53.13 128 , 7 , 7 , 14.546 , 7.642 , New , 52.54 Results For Tigerlake memcpy-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 8.979 , 4.95 , New , 55.13 128 , 7 , 0 , 14.245 , 7.122 , New , 50.0 128 , 0 , 7 , 12.668 , 6.675 , New , 52.69 128 , 7 , 7 , 13.042 , 6.802 , New , 52.15 Results For Skylake memmove-avx2-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 6.181 , 5.691 , New , 92.07 128 , 32 , 0 , 6.165 , 5.752 , New , 93.3 128 , 0 , 7 , 13.923 , 9.37 , New , 67.3 128 , 7 , 0 , 12.049 , 10.182 , New , 84.5 Results For Icelake memmove-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 5.479 , 4.889 , New , 89.23 128 , 32 , 0 , 5.127 , 4.911 , New , 95.79 128 , 0 , 7 , 18.885 , 13.547 , New , 71.73 128 , 7 , 0 , 15.565 , 14.436 , New , 92.75 Results For Tigerlake memmove-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 5.275 , 4.815 , New , 91.28 128 , 32 , 0 , 5.376 , 4.565 , New , 84.91 128 , 0 , 7 , 19.426 , 14.273 , New , 73.47 128 , 7 , 0 , 15.924 , 14.951 , New , 93.89 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Improve memset-vec-unaligned-erms.SNoah Goldstein2021-05-201-22/+28
| | | | | | | | | | | | | | | No bug. This commit makes a few small improvements to memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 instead of 128. Either alignment will perform equally well in a loop and 128 just increases the odds of having to do an extra iteration which can be significant overhead for small values. 2) Align some targets and the loop. 3) Remove an ALU from the alignment process. 4) Reorder the last 4x VEC so that they are stored after the loop. 5) Move the condition for leq 8x VEC to before the alignment process. test-memset and test-wmemset are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memcmp-evex-movbe.SNoah Goldstein2021-05-181-302/+408
| | | | | | | | | | | | No bug. This commit optimizes memcmp-evex.S. The optimizations include adding a new vec compare path for small sizes, reorganizing the entry control flow, removing some unnecissary ALU instructions from the main loop, and most importantly replacing the heavy use of vpcmp + kand logic with vpxor + vptern. test-memcmp and test-wmemcmp are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memcmp-avx2-movbe.SNoah Goldstein2021-05-183-281/+402
| | | | | | | | | | No bug. This commit optimizes memcmp-avx2.S. The optimizations include adding a new vec compare path for small sizes, reorganizing the entry control flow, and removing some unnecissary ALU instructions from the main loop. test-memcmp and test-wmemcmp are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* elf: Use relaxed atomics for racy accesses [BZ #19329]Szabolcs Nagy2021-05-111-1/+2
| | | | | | | | | | | | | | | | | | This is a follow up patch to the fix for bug 19329. This adds relaxed MO atomics to accesses that were previously data races but are now race conditions, and where relaxed MO is sufficient. The race conditions all follow the pattern that the write is behind the dlopen lock, but a read can happen concurrently (e.g. during tls access) without holding the lock. For slotinfo entries the read value only matters if it reads from a synchronized write in dlopen or dlclose, otherwise the related dtv entry is not valid to access so it is fine to leave it in an inconsistent state. The same applies for GL(dl_tls_max_dtv_idx) and GL(dl_tls_generation), but there the algorithm relies on the fact that the read of the last synchronized write is an increasing value. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86: Add EVEX optimized memchr family not safe for RTMNoah Goldstein2021-05-0810-41/+217
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. This commit adds a new implementation for EVEX memchr that is not safe for RTM because it uses vzeroupper. The benefit is that by using ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is faster than the RTM safe version which cannot use vpcmpeq because there is no EVEX encoding for the instruction. All parts of the implementation aside from the 4x loop are the same for the two versions and the optimization is only relevant for large sizes. Tigerlake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 9.2 , 9.04 , no-RTM , 0.16 512 , 7 , 224 , 9.19 , 8.98 , no-RTM , 0.21 2048 , 0 , 256 , 10.74 , 10.54 , no-RTM , 0.2 2048 , 0 , 512 , 14.81 , 14.87 , RTM , 0.06 2048 , 0 , 1024 , 22.97 , 22.57 , no-RTM , 0.4 2048 , 0 , 2048 , 37.49 , 34.51 , no-RTM , 2.98 <-- Icelake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 7.6 , 7.3 , no-RTM , 0.3 512 , 7 , 224 , 7.63 , 7.27 , no-RTM , 0.36 2048 , 0 , 256 , 8.48 , 8.38 , no-RTM , 0.1 2048 , 0 , 512 , 11.57 , 11.42 , no-RTM , 0.15 2048 , 0 , 1024 , 17.92 , 17.38 , no-RTM , 0.54 2048 , 0 , 2048 , 30.37 , 27.34 , no-RTM , 3.03 <-- test-memchr, test-wmemchr, and test-rawmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Fix an unknown vector operation in memchr-evex.SAlice Xu2021-05-071-1/+1
| | | | | | | An unknown vector operation occurred in commit 2a76821c308. Fixed it by using "ymm{k1}{z}" but not "ymm {k1} {z}". Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* Remove architecture specific sched_cpucount optimizationsAdhemerval Zanella2021-05-072-61/+0
| | | | | | | | | | | | | | And replace the generic algorithm with the Brian Kernighan's one. GCC optimize it with popcnt if the architecture supports, so there is no need to add the extra POPCNT define to enable it. This is really a micro-optimization that only adds complexity: recent ABIs already support it (x86-64-v2 or power64le) and it simplifies the code for internal usage, since i686 does not allow an internal iFUNC call. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu.
* x86: Optimize memchr-evex.SNoah Goldstein2021-05-031-225/+322
| | | | | | | | | | | | No bug. This commit optimizes memchr-evex.S. The optimizations include replacing some branches with cmovcc, avoiding some branches entirely in the less_4x_vec case, making the page cross logic less strict, saving some ALU in the alignment process, and most importantly increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and test-wmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memchr-avx2.SNoah Goldstein2021-05-031-178/+247
| | | | | | | | | | | No bug. This commit optimizes memchr-avx2.S. The optimizations include replacing some branches with cmovcc, avoiding some branches entirely in the less_4x_vec case, making the page cross logic less strict, asaving a few instructions the in loop return loop. test-memchr, test-rawmemchr, and test-wmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* regenerate ulps on x86_64 with -march=nativePaul Zimmermann2021-04-281-3/+3
| | | | | | | On x86_64, when configuring glibc with CFLAGS="-O2 -g -march=native", some tests fail. After this patch, "make check" succeeds. Tested on Intel Core i5-4590 with gcc 10.2.1.
* x86: Optimize strchr-evex.SNoah Goldstein2021-04-251-174/+218
| | | | | | | | | | No bug. This commit optimizes strchr-evex.S. The optimizations are mostly small things such as save an ALU in the alignment process, saving a few instructions in the loop return. The one significant change is saving 2 instructions in the 4x loop. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Optimize strchr-avx2.SNoah Goldstein2021-04-251-117/+169
| | | | | | | | | | No bug. This commit optimizes strchr-avx2.S. The optimizations are all small things such as save an ALU in the alignment process, saving a few instructions in the loop return, saving some bytes in the main loop, and increasing the ILP in the return cases. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* nptl: Move pthread_spin_trylock into libcFlorian Weimer2021-04-231-3/+10
| | | | The symbol was moved using scripts/move-symbol-to-libc.py.
* nptl: Move pthread_spin_lock into libcFlorian Weimer2021-04-231-2/+8
| | | | The symbol was moved using scripts/move-symbol-to-libc.py.
* nptl: Move pthread_spin_init, Move pthread_spin_unlock into libcFlorian Weimer2021-04-231-5/+11
| | | | | | | For some architectures, the two functions are aliased, so these symbols need to be moved at the same time. The symbols were moved using scripts/move-symbol-to-libc.py.
* x86: Remove low-level lock optimizationFlorian Weimer2021-04-211-1/+0
| | | | | | | | | | | | | | | The current approach is to do this optimizations at a higher level, in generic code, so that single-threaded cases can be specifically targeted. Furthermore, using IS_IN (libc) as a compile-time indicator that all locks are private is no longer correct once process-shared lock implementations are moved into libc. The generic <lowlevellock.h> is not compatible with assembler code (obviously), so it's necessary to remove two long-unused #includes. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* elf: Remove lazy tlsdesc relocation related codeSzabolcs Nagy2021-04-211-1/+0
| | | | | | | | | | | Remove generic tlsdesc code related to lazy tlsdesc processing since lazy tlsdesc relocation is no longer supported. This includes removing GL(dl_load_lock) from _dl_make_tlsdesc_dynamic which is only called at load time when that lock is already held. Added a documentation comment too. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86: Optimize strlen-avx2.SNoah Goldstein2021-04-192-214/+334
| | | | | | | | | | No bug. This commit optimizes strlen-avx2.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Optimize strlen-evex.SNoah Goldstein2021-04-191-264/+317
| | | | | | | | | | No bug. This commit optimizes strlen-evex.S. The optimizations are mostly small things but they add up to roughly 10-30% performance improvement for strlen. The results for strnlen are bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.SNoah Goldstein2021-04-195-27/+74
| | | | | | | | No bug. This commit adds optimized cased for less_vec memset case that uses the avx512vl/avx512bw mask store avoiding the excessive branches. test-memset and test-wmemset are passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 for strchr-avx2.SH.J. Lu2021-04-192-5/+11
| | | | | | | | | | | | | | | | | Since strchr-avx2.S updated by commit 1f745ecc2109890886b161d4791e1406fdfc29b8 Author: noah <goldstein.w.n@gmail.com> Date: Wed Feb 3 00:38:59 2021 -0500 x86-64: Refactor and improve performance of strchr-avx2.S uses sarx: c4 e2 72 f7 c0 sarx %ecx,%eax,%eax for strchr-avx2 family functions, require BMI2 in ifunc-impl-list.c and ifunc-avx2.h.
* x86-64: Require BMI2 for __strlen_evex and __strnlen_evexH.J. Lu2021-04-191-2/+4
| | | | | | | | | | | | | | | | | Since __strlen_evex and __strnlen_evex added by commit 1fd8c163a83d96ace1ff78fa6bac7aee084f6f77 Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri Mar 5 06:24:52 2021 -0800 x86-64: Add ifunc-avx2.h functions with 256-bit EVEX use sarx: c4 e2 6a f7 c0 sarx %edx,%eax,%eax require BMI2 for __strlen_evex and __strnlen_evex in ifunc-impl-list.c. ifunc-avx2.h already requires BMI2 for EVEX implementation.
* x86: Update large memcpy case in memmove-vec-unaligned-erms.Snoah2021-04-161-73/+265
| | | | | | | | | | | | | | No Bug. This commit updates the large memcpy case (no overlap). The update is to perform memcpy on either 2 or 4 contiguous pages at once. This 1) helps to alleviate the affects of false memory aliasing when destination and source have a close 4k alignment and 2) In most cases and for most DRAM units is a modestly more efficient access pattern. These changes are a clear performance improvement for VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy, test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86_64: Remove lazy tlsdesc relocation related codeSzabolcs Nagy2021-04-154-219/+2
| | | | | _dl_tlsdesc_resolve_rela and _dl_tlsdesc_resolve_hold are only used for lazy tlsdesc relocation processing which is no longer supported.
* x86_64: Avoid lazy relocation of tlsdesc [BZ #27137]Szabolcs Nagy2021-04-151-5/+14
| | | | | | | | | | | | | Lazy tlsdesc relocation is racy because the static tls optimization and tlsdesc management operations are done without holding the dlopen lock. This similar to the commit b7cf203b5c17dd6d9878537d41e0c7cc3d270a67 for aarch64, but it fixes a different race: bug 27137. Another issue is that ld auditing ignores DT_BIND_NOW and thus tries to relocate tlsdesc lazily, but that does not work in a BIND_NOW module due to missing DT_TLSDESC_PLT. Unconditionally relocating tlsdesc at load time fixes this bug 27721 too.
* Improve the accuracy of tgamma (BZ #26983)Paul Zimmermann2021-04-071-3/+3
| | | | | | | | | | | | With this patch, the maximal known error for tgamma is now reduced to 9 ulps for dbl-64, for all rounding modes. Since exhaustive testing is not possible for dbl-64, it might be that there are still cases with an error larger than 9 ulps, but all known cases are fixed (intensive tests were done to find cases with large errors). Tested on x86_64 and powerpc (and by Adhemerval Zanella on aarch64, arm, s390x, sparc, and i686). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* Fix the inaccuracy of j0f/j1f/y0f/y1f [BZ #14469, #14470, #14471, #14472]Paul Zimmermann2021-04-021-38/+38
| | | | | | | | | | | | | | | | | | | | | | | For j0f/j1f/y0f/y1f, the largest error for all binary32 inputs is reduced to at most 9 ulps for all rounding modes. The new code is enabled only when there is a cancellation at the very end of the j0f/j1f/y0f/y1f computation, or for very large inputs, thus should not give any visible slowdown on average. Two different algorithms are used: * around the first 64 zeros of j0/j1/y0/y1, approximation polynomials of degree 3 are used, computed using the Sollya tool (https://www.sollya.org/) * for large inputs, an asymptotic formula from [1] is used [1] Fast and Accurate Bessel Function Computation, John Harrison, Proceedings of Arith 19, 2009. Inputs yielding the new largest errors are added to auto-libm-test-in, and ulps are regenerated for various targets (thanks Adhemerval Zanella). Tested on x86_64 with --disable-multi-arch and on powerpc64le-linux-gnu. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86-64: Fix ifdef indentation in strlen-evex.SSunil K Pandey2021-04-011-8/+8
| | | | | Fix some indentations of ifdef in file strlen-evex.S which are off by 1 and confusing to read.
* x86_64: Correct THREAD_SETMEM/THREAD_SETMEM_NC for movq [BZ #27591]H.J. Lu2021-04-013-2/+74
| | | | | | | | | | | | | | | | config/i386/constraints.md in GCC has (define_constraint "e" "32-bit signed integer constant, or a symbolic reference known to fit that range (for immediate operands in sign-extending x86-64 instructions)." (match_operand 0 "x86_64_immediate_operand")) Since movq takes a signed 32-bit immediate or a register source operand, use "er", instead of "nr"/"ir", constraint for 32-bit signed integer constant or register on movq. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functionsH.J. Lu2021-03-293-19/+42
| | | | | | Update ifunc-memmove.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL since VZEROUPPER isn't needed at function exit.
* x86-64: Use ZMM16-ZMM31 in AVX512 memset family functionsH.J. Lu2021-03-294-24/+31
| | | | | | | Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
* x86-64: Add AVX optimized string/memory functions for RTMH.J. Lu2021-03-2952-248/+670
| | | | | | | | | | | | | | | | | Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX optimized string/memory functions with xtest jz 1f vzeroall ret 1: vzeroupper ret at function exit on processors with usable RTM, but without 256-bit EVEX instructions to avoid VZEROUPPER inside a transactionally executing RTM region.
* x86-64: Add memcmp family functions with 256-bit EVEXH.J. Lu2021-03-295-4/+467
| | | | | | | Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function exit.
* x86-64: Add memset family functions with 256-bit EVEXH.J. Lu2021-03-296-14/+90
| | | | | | | Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.