about summary refs log tree commit diff
Commit message (Collapse)AuthorAgeFilesLines
...
* x86: Add EVEX optimized str{n}casecmpNoah Goldstein2022-05-166-40/+321
| | | | | | | | | geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 84e7c46df4086873eae28a1fb87d2cf5388b1e16)
* x86: Add AVX2 optimized str{n}casecmpNoah Goldstein2022-05-168-31/+331
| | | | | | | | | geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit bbf81222343fed5cd704001a2ae0d86c71544151)
* x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.SNoah Goldstein2022-05-161-48/+35
| | | | | | | | | | | | | | | Slightly faster method of doing TOLOWER that saves an instruction. Also replace the hard coded 5-byte no with .p2align 4. On builds with CET enabled this misaligned entry to strcasecmp. geometric_mean(N=40) of all benchmarks New / Original: .920 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit d154758e618ec9324f5d339c46db0aa27e8b1226)
* x86: Optimize str{n}casecmp TOLOWER logic in strcmp.SNoah Goldstein2022-05-161-35/+29
| | | | | | | | | | | | | | | Slightly faster method of doing TOLOWER that saves an instruction. Also replace the hard coded 5-byte no with .p2align 4. On builds with CET enabled this misaligned entry to strcasecmp. geometric_mean(N=40) of all benchmarks New / Original: .894 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 670b54bc585ea4a94f3b2e9272ba44aa6b730b73)
* x86: Remove strspn-sse2.S and use the generic implementationNoah Goldstein2022-05-162-118/+3
| | | | | | | | | | | The generic implementation is faster. geometric_mean(N=20) of all benchmarks New / Original: .710 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 9c8a6ad620b49a27120ecdd7049c26bf05900397)
* x86: Remove strpbrk-sse2.S and use the generic implementationNoah Goldstein2022-05-162-7/+3
| | | | | | | | | The generic implementation is faster (see strcspn commit). All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 653358535280a599382cb6c77538a187dac6a87f)
* x86: Remove strcspn-sse2.S and use the generic implementationNoah Goldstein2022-05-162-125/+3
| | | | | | | | | | | The generic implementation is faster. geometric_mean(N=20) of all benchmarks New / Original: .678 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit fe28e7d9d9535ebab4081d195c553b4fbf39d9ae)
* x86: Optimize strspn in strspn-c.cNoah Goldstein2022-05-161-47/+39
| | | | | | | | | | | | | | Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of _mm_cmpistri. Also change offset to unsigned to avoid unnecessary sign extensions. geometric_mean(N=20) of all benchmarks that dont fallback on sse2; New / Original: .901 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 412d10343168b05b8cf6c3683457cf9711d28046)
* x86: Optimize strcspn and strpbrk in strcspn-c.cNoah Goldstein2022-05-161-46/+37
| | | | | | | | | | | | | | Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of _mm_cmpistri. Also change offset to unsigned to avoid unnecessary sign extensions. geometric_mean(N=20) of all benchmarks that dont fallback on sse2/strlen; New / Original: .928 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 30d627d477d7255345a4b713cf352ac32d644d61)
* x86: Code cleanup in strchr-evex and comment justifying branchNoah Goldstein2022-05-161-66/+80
| | | | | | | | | | | | | Small code cleanup for size: -81 bytes. Add comment justifying using a branch to do NULL/non-null return. All string/memory tests pass and no regressions in benchtests. geometric_mean(N=20) of all benchmarks New / Original: .985 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit ec285ea90415458225623ddc0492ae3f705af043)
* x86: Code cleanup in strchr-avx2 and comment justifying branchNoah Goldstein2022-05-161-97/+107
| | | | | | | | | | | | | Small code cleanup for size: -53 bytes. Add comment justifying using a branch to do NULL/non-null return. All string/memory tests pass and no regressions in benchtests. geometric_mean(N=20) of all benchmarks Original / New: 1.00 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit a6fbf4d51e9ba8063c4f8331564892ead9c67344)
* x86_64: Remove bcopy optimizationsAdhemerval Zanella2022-05-161-7/+0
| | | | | | | The symbols is not present in current POSIX specification and compiler already generates memmove call. (cherry picked from commit bf92893a14ebc161b08b28acc24fa06ae6be19cb)
* x86-64: Remove bzero weak alias in SS2 memsetH.J. Lu2022-05-161-3/+1
| | | | | | | | | | | | | commit 3d9f171bfb5325bd5f427e9fc386453358c6e840 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Feb 7 05:55:15 2022 -0800 x86-64: Optimize bzero added the optimized bzero. Remove bzero weak alias in SS2 memset to avoid undefined __bzero in memset-sse2-unaligned-erms. (cherry picked from commit 0fb8800029d230b3711bf722b2a47db92d0e273f)
* x86_64/multiarch: Sort sysdep_routines and put one entry per lineH.J. Lu2022-05-161-30/+48
| | | | (cherry picked from commit c328d0152d4b14cca58407ec68143894c8863004)
* x86: Improve L to support L(XXX_SYMBOL (YYY, ZZZ))H.J. Lu2022-05-161-1/+2
| | | | (cherry picked from commit 1283948f236f209b7d3f44b69a42b96806fa6da0)
* x86: Fix fallback for wcsncmp_avx2 in strcmp-avx2.S [BZ #28896]Noah Goldstein2022-05-052-1/+16
| | | | | | | | | | | | | | | | | | | | | | Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not __wcscmp_avx2. commit ddf0992cf57a93200e0c782e2a94d0733a5a0b87 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Jan 9 16:02:21 2022 -0600 x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which can cause spurious aborts. This change will need to be backported. All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 9fef7039a7d04947bc89296ee0d187bc8d89b772)
* x86: Fix bug in strncmp-evex and strncmp-avx2 [BZ #28895]Noah Goldstein2022-05-053-5/+17
| | | | | | | | | | | | | | | Logic can read before the start of `s1` / `s2` if both `s1` and `s2` are near the start of a page. To avoid having the result contimated by these comparisons the `strcmp` variants would mask off these comparisons. This was missing in the `strncmp` variants causing the bug. This commit adds the masking to `strncmp` so that out of range comparisons don't affect the result. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass as well a full xcheck on x86_64 linux. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit e108c02a5e23c8c88ce66d8705d4a24bb6b9a8bf)
* x86: Set .text section in memset-vec-unaligned-ermsNoah Goldstein2022-05-051-0/+1
| | | | | | | | | | | | | commit 3d9f171bfb5325bd5f427e9fc386453358c6e840 Author: H.J. Lu <hjl.tools@gmail.com> Date: Mon Feb 7 05:55:15 2022 -0800 x86-64: Optimize bzero Remove setting the .text section for the code. This commit adds that back. (cherry picked from commit 7912236f4a597deb092650ca79f33504ddb4af28)
* x86-64: Optimize bzeroH.J. Lu2022-05-0510-105/+382
| | | | | | | | | | | memset with zero as the value to set is by far the majority value (99%+ for Python3 and GCC). bzero can be slightly more optimized for this case by using a zero-idiom xor for broadcasting the set value to a register (vector or GPR). Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 3d9f171bfb5325bd5f427e9fc386453358c6e840)
* x86: Remove SSSE3 instruction for broadcast in memset.S (SSE2 Only)Noah Goldstein2022-05-051-3/+4
| | | | | | | | | | | | | commit b62ace2740a106222e124cc86956448fa07abf4d Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Sun Feb 6 00:54:18 2022 -0600 x86: Improve vec generation in memset-vec-unaligned-erms.S Revert usage of 'pshufb' in broadcast logic as it is an SSSE3 instruction and memset.S is restricted to only SSE2 instructions. (cherry picked from commit 1b0c60f95bbe2eded80b2bb5be75c0e45b11cde1)
* x86: Improve vec generation in memset-vec-unaligned-erms.SNoah Goldstein2022-05-055-87/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. Split vec generation into multiple steps. This allows the broadcast in AVX2 to use 'xmm' registers for the L(less_vec) case. This saves an expensive lane-cross instruction and removes the need for 'vzeroupper'. For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for byte broadcast. Results for memset-avx2 small (geomean of N = 20 benchset runs). size, New Time, Old Time, New / Old 0, 4.100, 3.831, 0.934 1, 5.074, 4.399, 0.867 2, 4.433, 4.411, 0.995 4, 4.487, 4.415, 0.984 8, 4.454, 4.396, 0.987 16, 4.502, 4.443, 0.987 All relevant string/wcsmbs tests are passing. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit b62ace2740a106222e124cc86956448fa07abf4d)
* x86-64: Fix strcmp-evex.SH.J. Lu2022-05-051-1/+1
| | | | | | | | | | | | Change "movl %edx, %rdx" to "movl %edx, %edx" in: commit 8418eb3ff4b781d31c4ed5dc6c0bd7356bc45db9 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jan 10 15:35:39 2022 -0600 x86: Optimize strcmp-evex.S (cherry picked from commit 0e0199a9e02ebe42e2b36958964d63f03573c382)
* x86-64: Fix strcmp-avx2.SH.J. Lu2022-05-051-1/+1
| | | | | | | | | | | | Change "movl %edx, %rdx" to "movl %edx, %edx" in: commit b77b06e0e296f1a2276c27a67e1d44f2cfa38d45 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jan 10 15:35:38 2022 -0600 x86: Optimize strcmp-avx2.S (cherry picked from commit c15efd011cea3d8f0494269eb539583215a1feed)
* x86: Optimize strcmp-evex.SNoah Goldstein2022-05-051-793/+919
| | | | | | | | | | | | | | | | | | | Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases as well are nearly universally improved. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 8418eb3ff4b781d31c4ed5dc6c0bd7356bc45db9)
* x86: Optimize strcmp-avx2.SNoah Goldstein2022-05-051-652/+940
| | | | | | | | | | | | | | | | | | | | | Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases are improved most for smaller sizes [0, 128] and go about even for (128, 4096]. The loop page cross logic is improved so some more significant speedup is seen there as well. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit b77b06e0e296f1a2276c27a67e1d44f2cfa38d45)
* x86: Optimize L(less_vec) case in memcmp-evex-movbe.SNoah Goldstein2022-05-021-193/+56
| | | | | | | | | | | | | | | | | | | No bug. Optimizations are twofold. 1) Replace page cross and 0/1 checks with masked load instructions in L(less_vec). In applications this reduces branch-misses in the hot [0, 32] case. 2) Change controlflow so that L(less_vec) case gets the fall through. Change 2) helps copies in the [0, 32] size range but comes at the cost of copies in the [33, 64] size range. From profiles of GCC and Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this appears to the the right tradeoff. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit abddd61de090ae84e380aff68a98bd94ef704667)
* x86: Don't set Prefer_No_AVX512 for processors with AVX512 and AVX-VNNIH.J. Lu2022-05-021-2/+5
| | | | | | | | Don't set Prefer_No_AVX512 on processors with AVX512 and AVX-VNNI since they won't lower CPU frequency when ZMM load and store instructions are used. (cherry picked from commit ceeffe968c01b1202e482f4855cb6baf5c6cb713)
* x86-64: Use notl in EVEX strcmp [BZ #28646]Noah Goldstein2022-05-022-6/+36
| | | | | | | | | Must use notl %edi here as lower bits are for CHAR comparisons potentially out of range thus can be 0 without indicating mismatch. This fixes BZ #28646. Co-Authored-By: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 4df1fa6ddc8925a75f3da644d5da3bb16eb33f02)
* x86: Shrink memcmp-sse4.S code sizeNoah Goldstein2022-05-021-1621/+646
| | | | | | | | | | | | | | | | | | | | | | | | | No bug. This implementation refactors memcmp-sse4.S primarily with minimizing code size in mind. It does this by removing the lookup table logic and removing the unrolled check from (256, 512] bytes. memcmp-sse4 code size reduction : -3487 bytes wmemcmp-sse4 code size reduction: -1472 bytes The current memcmp-sse4.S implementation has a large code size cost. This has serious adverse affects on the ICache / ITLB. While in micro-benchmarks the implementations appears fast, traces of real-world code have shown that the speed in micro benchmarks does not translate when the ICache/ITLB are not primed, and that the cost of the code size has measurable negative affects on overall application performance. See https://research.google/pubs/pub48320/ for more details. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 2f9062d7171850451e6044ef78d91ff8c017b9c0)
* x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.hNoah Goldstein2022-05-022-14/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. This patch doubles the rep_movsb_threshold when using ERMS. Based on benchmarks the vector copy loop, especially now that it handles 4k aliasing, is better for these medium ranged. On Skylake with ERMS: Size, Align1, Align2, dst>src,(rep movsb) / (vec copy) 4096, 0, 0, 0, 0.975 4096, 0, 0, 1, 0.953 4096, 12, 0, 0, 0.969 4096, 12, 0, 1, 0.872 4096, 44, 0, 0, 0.979 4096, 44, 0, 1, 0.83 4096, 0, 12, 0, 1.006 4096, 0, 12, 1, 0.989 4096, 0, 44, 0, 0.739 4096, 0, 44, 1, 0.942 4096, 12, 12, 0, 1.009 4096, 12, 12, 1, 0.973 4096, 44, 44, 0, 0.791 4096, 44, 44, 1, 0.961 4096, 2048, 0, 0, 0.978 4096, 2048, 0, 1, 0.951 4096, 2060, 0, 0, 0.986 4096, 2060, 0, 1, 0.963 4096, 2048, 12, 0, 0.971 4096, 2048, 12, 1, 0.941 4096, 2060, 12, 0, 0.977 4096, 2060, 12, 1, 0.949 8192, 0, 0, 0, 0.85 8192, 0, 0, 1, 0.845 8192, 13, 0, 0, 0.937 8192, 13, 0, 1, 0.939 8192, 45, 0, 0, 0.932 8192, 45, 0, 1, 0.927 8192, 0, 13, 0, 0.621 8192, 0, 13, 1, 0.62 8192, 0, 45, 0, 0.53 8192, 0, 45, 1, 0.516 8192, 13, 13, 0, 0.664 8192, 13, 13, 1, 0.659 8192, 45, 45, 0, 0.593 8192, 45, 45, 1, 0.575 8192, 2048, 0, 0, 0.854 8192, 2048, 0, 1, 0.834 8192, 2061, 0, 0, 0.863 8192, 2061, 0, 1, 0.857 8192, 2048, 13, 0, 0.63 8192, 2048, 13, 1, 0.629 8192, 2061, 13, 0, 0.627 8192, 2061, 13, 1, 0.62 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 475b63702ef38b69558fc3d31a0b66776a70f1d3)
* x86: Optimize memmove-vec-unaligned-erms.SNoah Goldstein2022-05-026-224/+381
| | | | | | | | | | | | | | | | | | | | | | | No bug. The optimizations are as follows: 1) Always align entry to 64 bytes. This makes behavior more predictable and makes other frontend optimizations easier. 2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have significant benefits in the case that: 0 < (dst - src) < [256, 512] 3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%] improvement and for FSRM [-10%, 25%]. In addition to these primary changes there is general cleanup throughout to optimize the aligning routines and control flow logic. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit a6b7502ec0c2da89a7437f43171f160d713e39c6)
* x86-64: Replace movzx with movzblFangrui Song2022-05-022-4/+4
| | | | | | | | | | | | | | Clang cannot assemble movzx in the AT&T dialect mode. ../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction movzx (%rsi), %ecx ^~~~ Change movzx to movzbl, which follows the AT&T dialect and is used elsewhere in the file. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 6720d36b6623c5e48c070d86acf61198b33e144e)
* x86-64: Remove Prefer_AVX2_STRCMPH.J. Lu2022-05-025-15/+2
| | | | | | | | | | | Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1 VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1 VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake. (cherry picked from commit 14dbbf46a007ae5df36646b51ad0c9e5f5259f30)
* x86-64: Improve EVEX strcmp with masked loadH.J. Lu2022-05-021-218/+243
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In strcmp-evex.S, to compare 2 32-byte strings, replace VMOVU (%rdi, %rdx), %YMM0 VMOVU (%rsi, %rdx), %YMM1 /* Each bit in K0 represents a mismatch in YMM0 and YMM1. */ VPCMP $4, %YMM0, %YMM1, %k0 VPCMP $0, %YMMZERO, %YMM0, %k1 VPCMP $0, %YMMZERO, %YMM1, %k2 /* Each bit in K1 represents a NULL in YMM0 or YMM1. */ kord %k1, %k2, %k1 /* Each bit in K1 represents a NULL or a mismatch. */ kord %k0, %k1, %k1 kmovd %k1, %ecx testl %ecx, %ecx jne L(last_vector) with VMOVU (%rdi, %rdx), %YMM0 VPTESTM %YMM0, %YMM0, %k2 /* Each bit cleared in K1 represents a mismatch or a null CHAR in YMM0 and 32 bytes at (%rsi, %rdx). */ VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2} kmovd %k1, %ecx incl %ecx jne L(last_vector) It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake. Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit c46e9afb2df5fc9e39ff4d13777e4b4c26e04e55)
* x86: Replace sse2 instructions with avx in memcmp-evex-movbe.SNoah Goldstein2022-05-021-2/+2
| | | | | | | | | | | | | | | | This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'. it could potentially be dangerous to use SSE2 if this function is ever called without using 'vzeroupper' beforehand. While compilers appear to use 'vzeroupper' before function calls if AVX2 has been used, using SSE2 here is more brittle. Since it is not absolutely necessary it should be avoided. It costs 2-extra bytes but the extra bytes should only eat into alignment padding. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit bad852b61b79503fcb3c5fc379c70f768df3e1fb)
* x86: Optimize memset-vec-unaligned-erms.SNoah Goldstein2022-05-025-95/+232
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. Optimization are 1. change control flow for L(more_2x_vec) to fall through to loop and jump for L(less_4x_vec) and L(less_8x_vec). This uses less code size and saves jumps for length > 4x VEC_SIZE. 2. For EVEX/AVX512 move L(less_vec) closer to entry. 3. Avoid complex address mode for length > 2x VEC_SIZE 4. Slightly better aligning code for the loop from the perspective of code size and uops. 5. Align targets so they make full use of their fetch block and if possible cache line. 6. Try and reduce total number of icache lines that will need to be pulled in for a given length. 7. Include "local" version of stosb target. For AVX2/EVEX/AVX512 jumping to the stosb target in the sse2 code section will almost certainly be to a new page. The new version does increase code size marginally by duplicating the target but should get better iTLB behavior as a result. test-memset, test-wmemset, and test-bzero are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit e59ced238482fd71f3e493717f14f6507346741e)
* x86: Optimize memcmp-evex-movbe.S for frontend behavior and sizeNoah Goldstein2022-05-021-192/+242
| | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. The frontend optimizations are to: 1. Reorganize logically connected basic blocks so they are either in the same cache line or adjacent cache lines. 2. Avoid cases when basic blocks unnecissarily cross cache lines. 3. Try and 32 byte align any basic blocks possible without sacrificing code size. Smaller / Less hot basic blocks are used for this. Overall code size shrunk by 168 bytes. This should make up for any extra costs due to aligning to 64 bytes. In general performance before deviated a great deal dependending on whether entry alignment % 64 was 0, 16, 32, or 48. These changes essentially make it so that the current implementation is at least equal to the best alignment of the original for any arguments. The only additional optimization is in the page cross case. Branch on equals case was removed from the size == [4, 7] case. As well the [4, 7] and [2, 3] case where swapped as [4, 7] is likely a more hot argument size. test-memcmp and test-wmemcmp are both passing. (cherry picked from commit 1bd8b8d58fc9967cc073d2c13bfb6befefca2faa)
* x86: Modify ENTRY in sysdep.h so that p2align can be specifiedNoah Goldstein2022-05-021-2/+5
| | | | | | | | | | | | | No bug. This change adds a new macro ENTRY_P2ALIGN which takes a second argument, log2 of the desired function alignment. The old ENTRY(name) macro is just ENTRY_P2ALIGN(name, 4) so this doesn't affect any existing functionality. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit fc5bd179ef3a953dff8d1655bd530d0e230ffe71)
* x86-64: Optimize load of all bits set into ZMM register [BZ #28252]H.J. Lu2022-05-0210-64/+11
| | | | | | | | | | | | | | | | | | Optimize loads of all bits set into ZMM register in AVX512 SVML codes by replacing vpbroadcastq .L_2il0floatpacket.16(%rip), %zmmX and vmovups .L_2il0floatpacket.13(%rip), %zmmX with vpternlogd $0xff, %zmmX, %zmmX, %zmmX This fixes BZ #28252. (cherry picked from commit 78c9ec9000f873abe7a15a91b87080a2e4308260)
* x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]Noah Goldstein2022-05-021-0/+10
| | | | | | | | | | | Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to __wcscmp_evex. For x86_64 this covers the entire address range so any length larger could not possibly be used to bound `s1` or `s2`. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 7e08db3359c86c94918feb33a1182cd0ff3bb10b)
* x86-64: Use testl to check __x86_string_controlH.J. Lu2022-05-021-2/+2
| | | | | | | | Use testl, instead of andl, to check __x86_string_control to avoid updating __x86_string_control. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 3c8b9879cab6d41787bc5b14c1748f62fd6d0e5f)
* x86-64: Add Avoid_Short_Distance_REP_MOVSBH.J. Lu2022-05-025-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 3ec5d83d2a237d39e7fd6ef7a0bc8ac4c171a4a5 Author: H.J. Lu <hjl.tools@gmail.com> Date: Sat Jan 25 14:19:40 2020 -0800 x86-64: Avoid rep movsb with short distance [BZ #27130] introduced some regressions on Intel processors without Fast Short REP MOV (FSRM). Add Avoid_Short_Distance_REP_MOVSB to avoid rep movsb with short distance only on Intel processors with FSRM. bench-memmove-large on Skylake server shows that cycles of __memmove_evex_unaligned_erms improves for the following data size: before after Improvement length=4127, align1=3, align2=0: 479.38 349.25 27% length=4223, align1=9, align2=5: 405.62 333.25 18% length=8223, align1=3, align2=0: 786.12 496.38 37% length=8319, align1=9, align2=5: 727.50 501.38 31% length=16415, align1=3, align2=0: 1436.88 840.00 41% length=16511, align1=9, align2=5: 1375.50 836.38 39% length=32799, align1=3, align2=0: 2890.00 1860.12 36% length=32895, align1=9, align2=5: 2891.38 1931.88 33% (cherry picked from commit 91cc803d27bda34919717b496b53cf279e44a922)
* x86: Remove unnecessary overflow check from wcsnlen-sse4_1.SNoah Goldstein2022-05-021-7/+0
| | | | | | | | | | | | | | | | | | | | | | No bug. The way wcsnlen will check if near the end of maxlen is the following macro: mov %r11, %rsi; \ subq %rax, %rsi; \ andq $-64, %rax; \ testq $-64, %rsi; \ je L(strnlen_ret) Which words independently of s + maxlen overflowing. So the second overflow check is unnecissary for correctness and just extra overhead in the common no overflow case. test-strlen.c, test-wcslen.c, test-strnlen.c and test-wcsnlen.c are all passing Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 08cbcd4dbc686bb38ec3093aff2f919fbff5ec17)
* x86: Improve memmove-vec-unaligned-erms.SNoah Goldstein2022-05-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes the condition for copy 4x VEC so that if length is exactly equal to 4 * VEC_SIZE it will use the 4x VEC case instead of 8x VEC case. Results For Skylake memcpy-avx2-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 9.137 , 6.873 , New , 75.22 128 , 7 , 0 , 12.933 , 7.732 , New , 59.79 128 , 0 , 7 , 11.852 , 6.76 , New , 57.04 128 , 7 , 7 , 12.587 , 6.808 , New , 54.09 Results For Icelake memcpy-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 9.963 , 5.416 , New , 54.36 128 , 7 , 0 , 16.467 , 8.061 , New , 48.95 128 , 0 , 7 , 14.388 , 7.644 , New , 53.13 128 , 7 , 7 , 14.546 , 7.642 , New , 52.54 Results For Tigerlake memcpy-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 0 , 8.979 , 4.95 , New , 55.13 128 , 7 , 0 , 14.245 , 7.122 , New , 50.0 128 , 0 , 7 , 12.668 , 6.675 , New , 52.69 128 , 7 , 7 , 13.042 , 6.802 , New , 52.15 Results For Skylake memmove-avx2-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 6.181 , 5.691 , New , 92.07 128 , 32 , 0 , 6.165 , 5.752 , New , 93.3 128 , 0 , 7 , 13.923 , 9.37 , New , 67.3 128 , 7 , 0 , 12.049 , 10.182 , New , 84.5 Results For Icelake memmove-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 5.479 , 4.889 , New , 89.23 128 , 32 , 0 , 5.127 , 4.911 , New , 95.79 128 , 0 , 7 , 18.885 , 13.547 , New , 71.73 128 , 7 , 0 , 15.565 , 14.436 , New , 92.75 Results For Tigerlake memmove-evex-erms size, al1 , al2 , Cur T , New T , Win , New / Cur 128 , 0 , 32 , 5.275 , 4.815 , New , 91.28 128 , 32 , 0 , 5.376 , 4.565 , New , 84.91 128 , 0 , 7 , 19.426 , 14.273 , New , 73.47 128 , 7 , 0 , 15.924 , 14.951 , New , 93.89 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 1b992204f68af851e905c16016756fd4421e1934)
* x86: Improve memset-vec-unaligned-erms.SNoah Goldstein2022-05-021-22/+28
| | | | | | | | | | | | | | | | No bug. This commit makes a few small improvements to memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 instead of 128. Either alignment will perform equally well in a loop and 128 just increases the odds of having to do an extra iteration which can be significant overhead for small values. 2) Align some targets and the loop. 3) Remove an ALU from the alignment process. 4) Reorder the last 4x VEC so that they are stored after the loop. 5) Move the condition for leq 8x VEC to before the alignment process. test-memset and test-wmemset are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 6abf27980a947f9b6e514d6b33b83059d39566ae)
* x86: Optimize memcmp-evex-movbe.SNoah Goldstein2022-05-021-302/+408
| | | | | | | | | | | | | No bug. This commit optimizes memcmp-evex.S. The optimizations include adding a new vec compare path for small sizes, reorganizing the entry control flow, removing some unnecissary ALU instructions from the main loop, and most importantly replacing the heavy use of vpcmp + kand logic with vpxor + vptern. test-memcmp and test-wmemcmp are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 4ad473e97acdc5f6d811755b67c09f2128a644ce)
* x86: Optimize memcmp-avx2-movbe.SNoah Goldstein2022-05-023-281/+402
| | | | | | | | | | | No bug. This commit optimizes memcmp-avx2.S. The optimizations include adding a new vec compare path for small sizes, reorganizing the entry control flow, and removing some unnecissary ALU instructions from the main loop. test-memcmp and test-wmemcmp are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 16d12015c57701b08d7bbed6ec536641bcafb428)
* x86: Add EVEX optimized memchr family not safe for RTMNoah Goldstein2022-05-0210-41/+217
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. This commit adds a new implementation for EVEX memchr that is not safe for RTM because it uses vzeroupper. The benefit is that by using ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is faster than the RTM safe version which cannot use vpcmpeq because there is no EVEX encoding for the instruction. All parts of the implementation aside from the 4x loop are the same for the two versions and the optimization is only relevant for large sizes. Tigerlake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 9.2 , 9.04 , no-RTM , 0.16 512 , 7 , 224 , 9.19 , 8.98 , no-RTM , 0.21 2048 , 0 , 256 , 10.74 , 10.54 , no-RTM , 0.2 2048 , 0 , 512 , 14.81 , 14.87 , RTM , 0.06 2048 , 0 , 1024 , 22.97 , 22.57 , no-RTM , 0.4 2048 , 0 , 2048 , 37.49 , 34.51 , no-RTM , 2.98 <-- Icelake: size , algn , Pos , Cur T , New T , Win , Dif 512 , 6 , 192 , 7.6 , 7.3 , no-RTM , 0.3 512 , 7 , 224 , 7.63 , 7.27 , no-RTM , 0.36 2048 , 0 , 256 , 8.48 , 8.38 , no-RTM , 0.1 2048 , 0 , 512 , 11.57 , 11.42 , no-RTM , 0.15 2048 , 0 , 1024 , 17.92 , 17.38 , no-RTM , 0.54 2048 , 0 , 2048 , 30.37 , 27.34 , no-RTM , 3.03 <-- test-memchr, test-wmemchr, and test-rawmemchr are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 104c7b1967c3e78435c6f7eab5e225a7eddf9c6e)
* x86: Set rep_movsb_threshold to 2112 on processors with FSRMH.J. Lu2022-05-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The glibc memcpy benchmark on Intel Core i7-1065G7 (Ice Lake) showed that REP MOVSB became faster after 2112 bytes: Vector Move REP MOVSB length=2112, align1=0, align2=0: 24.20 24.40 length=2112, align1=1, align2=0: 26.07 23.13 length=2112, align1=0, align2=1: 27.18 28.13 length=2112, align1=1, align2=1: 26.23 25.16 length=2176, align1=0, align2=0: 23.18 22.52 length=2176, align1=2, align2=0: 25.45 22.52 length=2176, align1=0, align2=2: 27.14 27.82 length=2176, align1=2, align2=2: 22.73 25.56 length=2240, align1=0, align2=0: 24.62 24.25 length=2240, align1=3, align2=0: 29.77 27.15 length=2240, align1=0, align2=3: 35.55 29.93 length=2240, align1=3, align2=3: 34.49 25.15 length=2304, align1=0, align2=0: 34.75 26.64 length=2304, align1=4, align2=0: 32.09 22.63 length=2304, align1=0, align2=4: 28.43 31.24 Use REP MOVSB for data size > 2112 bytes in memcpy on processors with fast short REP MOVSB (FSRM). * sysdeps/x86/dl-cacheinfo.h (dl_init_cacheinfo): Set rep_movsb_threshold to 2112 on processors with fast short REP MOVSB (FSRM). (cherry picked from commit cf2c57526ba4b57e6863ad4db8a868e2678adce8)
* x86: Optimize strchr-evex.SNoah Goldstein2022-05-021-174/+218
| | | | | | | | | | | No bug. This commit optimizes strchr-evex.S. The optimizations are mostly small things such as save an ALU in the alignment process, saving a few instructions in the loop return. The one significant change is saving 2 instructions in the 4x loop. test-strchr, test-strchrnul, test-wcschr, and test-wcschrnul are all passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 7f3e7c262cab4e2401e4331a6ef29c428de02044)