about summary refs log tree commit diff
path: root/sysdeps/x86_64
Commit message (Collapse)AuthorAgeFilesLines
* math: Add more inputs to atan2 accuracy tests [BZ #28765]Sunil K Pandey2022-01-141-4/+4
| | | | | | | | | | | | | | | This patch adds following inputs: 0x1.bcab29da0e947p-54 0x1.bc41f4d2294b8p-54 0x1.a11891ec004d4p-348 0x1.814830510be26p-348 0x1.b836ed678be29p-588 0x1.b7be6f5a03a8cp-588 0x1.a83f842ef3f73p-633 0x1.a799d8a6677ep-633 to atan2 tests and updates x86_64 double atan2 ulps. This fixes BZ #28765. Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
* x86_64: Fix SSE4.2 libmvec atan2 function accuracy [BZ #28765]Sunil K Pandey2022-01-121-148/+173
| | | | | | | | | | | | This patch fixes SSE4.2 libmvec atan2 function accuracy for following inputs to less than 4 ulps. {0x1.bcab29da0e947p-54,0x1.bc41f4d2294b8p-54} 4.19888 ulps {0x1.b836ed678be29p-588,0x1.b7be6f5a03a8cp-588} 4.09889 ulps This fixes BZ #28765. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]Noah Goldstein2022-01-101-0/+10
| | | | | | | | | | Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to __wcscmp_evex. For x86_64 this covers the entire address range so any length larger could not possibly be used to bound `s1` or `s2`. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]Noah Goldstein2022-01-101-0/+10
| | | | | | | | | | Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to __wcscmp_avx2. For x86_64 this covers the entire address range so any length larger could not possibly be used to bound `s1` or `s2`. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
* Update copyright dates with scripts/update-copyrightsPaul Eggert2022-01-011086-1088/+1088
| | | | | | | | | | | | | | | | | | | | | | | I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 7061 files FOO. I then removed trailing white space from math/tgmath.h, support/tst-support-open-dev-null-range.c, and sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following obscure pre-commit check failure diagnostics from Savannah. I don't know why I run into these diagnostics whereas others evidently do not. remote: *** 912-#endif remote: *** 913: remote: *** 914- remote: *** error: lines with trailing whitespace found ... remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
* x86-64: Add vector tan/tanf implementation to libmvecSunil K Pandey2021-12-3045-0/+21885
| | | | | | | | Implement vectorized tan/tanf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector tan/tanf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector erfc/erfcf implementation to libmvecSunil K Pandey2021-12-3045-0/+14942
| | | | | | | | Implement vectorized erfc/erfcf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector erfc/erfcf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector asinh/asinhf implementation to libmvecSunil K Pandey2021-12-2945-0/+5756
| | | | | | | | Implement vectorized asinh/asinhf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector asinh/asinhf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector tanh/tanhf implementation to libmvecSunil K Pandey2021-12-2945-0/+5619
| | | | | | | | Implement vectorized tanh/tanhf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector tanh/tanhf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector erf/erff implementation to libmvecSunil K Pandey2021-12-2945-0/+5016
| | | | | | | | Implement vectorized erf/erff containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector erf/erff with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector acosh/acoshf implementation to libmvecSunil K Pandey2021-12-2945-0/+5237
| | | | | | | | Implement vectorized acosh/acoshf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector acosh/acoshf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector atanh/atanhf implementation to libmvecSunil K Pandey2021-12-2945-0/+5032
| | | | | | | | Implement vectorized atanh/atanhf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector atanh/atanhf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector log1p/log1pf implementation to libmvecSunil K Pandey2021-12-2945-0/+4419
| | | | | | | | Implement vectorized log1p/log1pf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector log1p/log1pf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector log2/log2f implementation to libmvecSunil K Pandey2021-12-2945-0/+4180
| | | | | | | | Implement vectorized log2/log2f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector log2/log2f with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector log10/log10f implementation to libmvecSunil K Pandey2021-12-2945-0/+3730
| | | | | | | | Implement vectorized log10/log10f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector log10/log10f with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector atan2/atan2f implementation to libmvecSunil K Pandey2021-12-2945-0/+3089
| | | | | | | | Implement vectorized atan2/atan2f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector atan2/atan2f with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector cbrt/cbrtf implementation to libmvecSunil K Pandey2021-12-2945-0/+3003
| | | | | | | | Implement vectorized cbrt/cbrtf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector cbrt/cbrtf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector sinh/sinhf implementation to libmvecSunil K Pandey2021-12-2945-0/+2866
| | | | | | | | Implement vectorized sinh/sinhf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector sinh/sinhf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector expm1/expm1f implementation to libmvecSunil K Pandey2021-12-2945-0/+2697
| | | | | | | | Implement vectorized expm1/expm1f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector expm1/expm1f with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector cosh/coshf implementation to libmvecSunil K Pandey2021-12-2945-0/+2609
| | | | | | | | Implement vectorized cosh/coshf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector cosh/coshf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector exp10/exp10f implementation to libmvecSunil K Pandey2021-12-2945-0/+2589
| | | | | | | | Implement vectorized exp10/exp10f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector exp10/exp10f with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector exp2/exp2f implementation to libmvecSunil K Pandey2021-12-2945-0/+2265
| | | | | | | | Implement vectorized exp2/exp2f containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector exp2/exp2f with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector hypot/hypotf implementation to libmvecSunil K Pandey2021-12-2945-0/+2123
| | | | | | | | Implement vectorized hypot/hypotf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector hypot/hypotf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector asin/asinf implementation to libmvecSunil K Pandey2021-12-2945-0/+2161
| | | | | | | | Implement vectorized asin/asinf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector asin/asinf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector atan/atanf implementation to libmvecSunil K Pandey2021-12-2945-0/+1713
| | | | | | | | Implement vectorized atan/atanf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector atan/atanf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* malloc: Remove memusage.hAdhemerval Zanella2021-12-281-20/+0
| | | | | | And use machine-sp.h instead. The Linux implementation is based on already provided CURRENT_STACK_FRAME (used on nptl code) and STACK_GROWS_UPWARD is replaced with _STACK_GROWS_UP.
* malloc: Use hp-timing on libmemusageAdhemerval Zanella2021-12-281-1/+0
| | | | Instead of reimplemeting on GETTIME macro.
* elf: Add _dl_audit_pltexitAdhemerval Zanella2021-12-282-4/+4
| | | | | | | | | It consolidates the code required to call la_pltexit audit callback. Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu. Reviewed-by: Florian Weimer <fweimer@redhat.com>
* hurd: Fix static-PIE startupSamuel Thibault2021-12-281-0/+28
| | | | | | | | hurd initialization stages use RUN_HOOK to run various initialization functions. That is however using absolute addresses which need to be relocated, which is done later by csu. We can however easily make the linker compute relative addresses which thus don't need a relocation. The new SET_RELHOOK and RUN_RELHOOK macros implement this.
* x86: Optimize L(less_vec) case in memcmpeq-evex.SNoah Goldstein2021-12-271-127/+43
| | | | | | | | | | | | | | | | | | No bug. Optimizations are twofold. 1) Replace page cross and 0/1 checks with masked load instructions in L(less_vec). In applications this reduces branch-misses in the hot [0, 32] case. 2) Change controlflow so that L(less_vec) case gets the fall through. Change 2) helps copies in the [0, 32] size range but comes at the cost of copies in the [33, 64] size range. From profiles of GCC and Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this appears to the the right tradeoff. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize L(less_vec) case in memcmp-evex-movbe.SNoah Goldstein2021-12-271-193/+56
| | | | | | | | | | | | | | | | | | No bug. Optimizations are twofold. 1) Replace page cross and 0/1 checks with masked load instructions in L(less_vec). In applications this reduces branch-misses in the hot [0, 32] case. 2) Change controlflow so that L(less_vec) case gets the fall through. Change 2) helps copies in the [0, 32] size range but comes at the cost of copies in the [33, 64] size range. From profiles of GCC and Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this appears to the the right tradeoff. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector acos/acosf implementation to libmvecSunil K Pandey2021-12-2246-0/+2285
| | | | | | | | Implement vectorized acos/acosf containing SSE, AVX, AVX2 and AVX512 versions for libmvec as per vector ABI. It also contains accuracy and ABI tests for vector acos/acosf with regenerated ulps. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* Regenerate ulps on x86_64 with GCC 12H.J. Lu2021-12-201-1/+1
| | | | | | | | | Fix FAIL: math/test-float-clog10 FAIL: math/test-float32-clog10 on Intel Core i7-1165G7 with GCC 12.
* Cleanup encoding in commentsSiddhesh Poyarekar2021-12-131-10/+10
| | | | | | | | | Replace non-UTF-8 and non-ASCII characters in comments with their UTF-8 equivalents so that files don't end up with mixed encodings. With this, all files (except tests that actually test different encodings) have a single encoding. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* Remove TLS_TCB_ALIGN and TLS_INIT_TCB_ALIGNFlorian Weimer2021-12-091-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TLS_INIT_TCB_ALIGN is not actually used. TLS_TCB_ALIGN was likely introduced to support a configuration where the thread pointer has not the same alignment as THREAD_SELF. Only ia64 seems to use that, but for the stack/pointer guard, not for storing tcbhead_t. Some ports use TLS_TCB_OFFSET and TLS_PRE_TCB_SIZE to shift the thread pointer, potentially landing in a different residue class modulo the alignment, but the changes should not impact that. In general, given that TLS variables have their own alignment requirements, having different alignment for the (unshifted) thread pointer and struct pthread would potentially result in dynamic offsets, leading to more complexity. hppa had different values before: __alignof__ (tcbhead_t), which seems to be 4, and __alignof__ (struct pthread), which was 8 (old default) and is now 32. However, it defines THREAD_SELF as: /* Return the thread descriptor for the current thread. */ # define THREAD_SELF \ ({ struct pthread *__self; \ __self = __get_cr27(); \ __self - 1; \ }) So the thread pointer points after struct pthread (hence __self - 1), and they have to have the same alignment on hppa as well. Similarly, on ia64, the definitions were different. We have: # define TLS_PRE_TCB_SIZE \ (sizeof (struct pthread) \ + (PTHREAD_STRUCT_END_PADDING < 2 * sizeof (uintptr_t) \ ? ((2 * sizeof (uintptr_t) + __alignof__ (struct pthread) - 1) \ & ~(__alignof__ (struct pthread) - 1)) \ : 0)) # define THREAD_SELF \ ((struct pthread *) ((char *) __thread_self - TLS_PRE_TCB_SIZE)) And TLS_PRE_TCB_SIZE is a multiple of the struct pthread alignment (confirmed by the new _Static_assert in sysdeps/ia64/libc-tls.c). On m68k, we have a larger gap between tcbhead_t and struct pthread. But as far as I can tell, the port is fine with that. The definition of TCB_OFFSET is sufficient to handle the shifted TCB scenario. This fixes commit 23c77f60181eb549f11ec2f913b4270af29eee38 ("nptl: Increase default TCB alignment to 32"). Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* nptl: Introduce THREAD_GETMEM_VOLATILEFlorian Weimer2021-12-091-0/+2
| | | | | | This will be needed for rseq TCB access. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* nptl: Introduce <tcb-access.h> for THREAD_* accessorsFlorian Weimer2021-12-092-113/+131
| | | | | | | These are common between most architectures. Only the x86 targets are outliers. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* x86-64: Use notl in EVEX strcmp [BZ #28646]Noah Goldstein2021-12-031-6/+8
| | | | | | | | Must use notl %edi here as lower bits are for CHAR comparisons potentially out of range thus can be 0 without indicating mismatch. This fixes BZ #28646. Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector sin/sinf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | Add vector sin/sinf and input files to libmvec microbenchmark. libmvec-sin-inputs: 90% Normal random distribution range: (-DBL_MAX, DBL_MAX) mean: 0.0 sigma: 5.0 10% uniform random distribution in range (-1000.0, 1000.0) libmvec-sinf-inputs: 90% Normal random distribution range: (-FLT_MAX, FLT_MAX) mean: 0.0f sigma: 5.0f 10% uniform random distribution in range (-1000.0f, 1000.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector pow/powf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add vector pow/powf and input files to libmvec microbenchmark. libmvec-pow-inputs: arg1: 90% Normal random distribution range: (0.0, 256.0) mean: 0.0 sigma: 32.0 10% uniform random distribution in range (0.0, 256.0) arg2: 90% Normal random distribution range: (-127.0, 127.0) mean: 0.0 sigma: 16.0 10% uniform random distribution in range (-127.0, 127.0) libmvec-powf-inputs: arg1: 90% Normal random distribution range: (0.0f, 100.0f) mean: 0.0f sigma: 16.0f 10% uniform random distribution in range (0.0f, 100.0f) arg2: 90% Normal random distribution range: (-10.0f, 10.0f) mean: 0.0f sigma: 8.0f 10% uniform random distribution in range (-10.0f, 10.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector log/logf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | Add vector log/logf and input files to libmvec microbenchmark. libmvec-log-inputs: 70% Normal random distribution range: (0.0, DBL_MAX) mean: 1.0 sigma: 50.0 30% uniform random distribution in range (0.0, DBL_MAX) libmvec-logf-inputs: 70% Normal random distribution range: (0.0f, FLT_MAX) mean: 1.0f sigma: 50.0f 30% uniform random distribution in range (0.0f, FLT_MAX) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector exp/expf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | Add vector exp/expf and input files to libmvec microbenchmark. libmvec-exp-inputs: 90% Normal random distribution range: (-708.0, 709.0) mean: 0.0 sigma: 16.0 10% uniform random distribution in range (-500.0, 500.0) libmvec-expf-inputs: 90% Normal random distribution range: (-87.0f, 88.0f) mean: 0.0f sigma: 8.0f 10% uniform random distribution in range (-50.0f, 50.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector cos/cosf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | Add vector cos/cosf and input files to libmvec microbenchmark. libmvec-cos-inputs: 90% Normal random distribution range: (-DBL_MAX, DBL_MAX) mean: 0.0 sigma: 5.0 10% uniform random distribution in range (-1000.0, 1000.0) libmvec-cosf-inputs: 90% Normal random distribution range: (-FLT_MAX, FLT_MAX) mean: 0.0f sigma: 5.0f 10% uniform random distribution in range (-1000.0f, 1000.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Create microbenchmark infrastructure for libmvecSunil K Pandey2021-11-164-0/+642
| | | | | | | | | | | | | Add python script to generate libmvec microbenchmark from the input values for each libmvec function using skeleton benchmark template. Creates double and float benchmarks with vector length 1, 2, 4, 8, and 16 for each libmvec function. Vector length 1 corresponds to scalar version of function and is included for vector function perf comparison. Co-authored-by: Haochen Jiang <haochen.jiang@intel.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Shrink memcmp-sse4.S code sizeNoah Goldstein2021-11-101-1621/+646
| | | | | | | | | | | | | | | | | | | | | | | | No bug. This implementation refactors memcmp-sse4.S primarily with minimizing code size in mind. It does this by removing the lookup table logic and removing the unrolled check from (256, 512] bytes. memcmp-sse4 code size reduction : -3487 bytes wmemcmp-sse4 code size reduction: -1472 bytes The current memcmp-sse4.S implementation has a large code size cost. This has serious adverse affects on the ICache / ITLB. While in micro-benchmarks the implementations appears fast, traces of real-world code have shown that the speed in micro benchmarks does not translate when the ICache/ITLB are not primed, and that the cost of the code size has measurable negative affects on overall application performance. See https://research.google/pubs/pub48320/ for more details. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memmove-vec-unaligned-erms.SNoah Goldstein2021-11-066-224/+381
| | | | | | | | | | | | | | | | | | | | | | No bug. The optimizations are as follows: 1) Always align entry to 64 bytes. This makes behavior more predictable and makes other frontend optimizations easier. 2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have significant benefits in the case that: 0 < (dst - src) < [256, 512] 3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%] improvement and for FSRM [-10%, 25%]. In addition to these primary changes there is general cleanup throughout to optimize the aligning routines and control flow logic. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Replace movzx with movzblFangrui Song2021-11-022-4/+4
| | | | | | | | | | | | | Clang cannot assemble movzx in the AT&T dialect mode. ../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction movzx (%rsi), %ecx ^~~~ Change movzx to movzbl, which follows the AT&T dialect and is used elsewhere in the file. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Remove Prefer_AVX2_STRCMPH.J. Lu2021-11-012-4/+2
| | | | | | | | | Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1 VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1 VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake.
* x86-64: Improve EVEX strcmp with masked loadH.J. Lu2021-11-011-218/+243
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In strcmp-evex.S, to compare 2 32-byte strings, replace VMOVU (%rdi, %rdx), %YMM0 VMOVU (%rsi, %rdx), %YMM1 /* Each bit in K0 represents a mismatch in YMM0 and YMM1. */ VPCMP $4, %YMM0, %YMM1, %k0 VPCMP $0, %YMMZERO, %YMM0, %k1 VPCMP $0, %YMMZERO, %YMM1, %k2 /* Each bit in K1 represents a NULL in YMM0 or YMM1. */ kord %k1, %k2, %k1 /* Each bit in K1 represents a NULL or a mismatch. */ kord %k0, %k1, %k1 kmovd %k1, %ecx testl %ecx, %ecx jne L(last_vector) with VMOVU (%rdi, %rdx), %YMM0 VPTESTM %YMM0, %YMM0, %k2 /* Each bit cleared in K1 represents a mismatch or a null CHAR in YMM0 and 32 bytes at (%rsi, %rdx). */ VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2} kmovd %k1, %ecx incl %ecx jne L(last_vector) It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake. Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
* x86_64: Add memcmpeq.S to fix disable-multi-arch buildNoah Goldstein2021-10-281-0/+21
| | | | | | | | | | | | | | | | The following commit: commit cf4fd28ea453d1a9cec93939bc88b58ccef5437a Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Tue Oct 26 19:43:18 2021 -0500 Broke --disable-multi-arch build for x86_64 because x86_64/memcmpeq.S was not defined outside of multiarch and the alias for __memcmpeq in x86_64/memcmp.S was removed. This commit fixes that issue by adding x86_64/memcmpeq.S. make xcheck passes on x86_64 with and without --disable-multi-arch