about summary refs log tree commit diff
path: root/sysdeps/x86_64
Commit message (Collapse)AuthorAgeFilesLines
* x86_64: Optimize ffsll function code size.Sunil K Pandey2024-01-311-5/+5
| | | | | | | | | | | | | | | | | | Ffsll function randomly regress by ~20%, depending on how code gets aligned in memory. Ffsll function code size is 17 bytes. Since default function alignment is 16 bytes, it can load on 16, 32, 48 or 64 bytes aligned memory. When ffsll function load at 16, 32 or 64 bytes aligned memory, entire code fits in single 64 bytes cache line. When ffsll function load at 48 bytes aligned memory, it splits in two cache line, hence random regression. Ffsll function size reduction from 17 bytes to 12 bytes ensures that it will always fit in single 64 bytes cache line. This patch fixes ffsll function random performance regression. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 9d94997b5f9445afd4f2bccc5fa60ff7c4361ec1)
* x86-64: Fix the tcb field load for x32 [BZ #31185]H.J. Lu2023-12-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | _dl_tlsdesc_undefweak and _dl_tlsdesc_dynamic access the thread pointer via the tcb field in TCB: _dl_tlsdesc_undefweak: _CET_ENDBR movq 8(%rax), %rax subq %fs:0, %rax ret _dl_tlsdesc_dynamic: ... subq %fs:0, %rax movq -8(%rsp), %rdi ret Since the tcb field in TCB is a pointer, %fs:0 is a 32-bit location, not 64-bit. It should use "sub %fs:0, %RAX_LP" instead. Since _dl_tlsdesc_undefweak returns ptrdiff_t and _dl_make_tlsdesc_dynamic returns void *, RAX_LP is appropriate here for x32 and x86-64. This fixes BZ #31185. (cherry picked from commit 81be2a61dafc168327c1639e97b6dae128c7ccf3)
* x86-64: Fix the dtv field load for x32 [BZ #31184]H.J. Lu2023-12-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On x32, I got FAIL: elf/tst-tlsgap $ gdb elf/tst-tlsgap ... open tst-tlsgap-mod1.so Thread 2 "tst-tlsgap" received signal SIGSEGV, Segmentation fault. [Switching to LWP 2268754] _dl_tlsdesc_dynamic () at ../sysdeps/x86_64/dl-tlsdesc.S:108 108 movq (%rsi), %rax (gdb) p/x $rsi $4 = 0xf7dbf9005655fb18 (gdb) This is caused by _dl_tlsdesc_dynamic: _CET_ENDBR /* Preserve call-clobbered registers that we modify. We need two scratch regs anyway. */ movq %rsi, -16(%rsp) movq %fs:DTV_OFFSET, %rsi Since the dtv field in TCB is a pointer, %fs:DTV_OFFSET is a 32-bit location, not 64-bit. Load the dtv field to RSI_LP instead of rsi. This fixes BZ #31184. (cherry picked from commit 3502440397bbb840e2f7223734aa5cc2cc0e29b6)
* x86_64: Fix asm constraints in feraiseexcept (bug 30305)Florian Weimer2023-04-241-2/+2
| | | | | | | | | | The divss instruction clobbers its first argument, and the constraints need to reflect that. Fortunately, with GCC 12, generated code does not actually change, so there is no externally visible bug. Suggested-by: Jakub Jelinek <jakub@redhat.com> Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 5d1ccdda7b0c625751661d50977f3dfbc73f8eae)
* Regenerate ulps on x86_64 with GCC 12H.J. Lu2023-01-111-1/+1
| | | | | | | | | | | Fix FAIL: math/test-float-clog10 FAIL: math/test-float32-clog10 on Intel Core i7-1165G7 with GCC 12. (cherry picked from commit de8a0897e3c084dc93676e331b610f146000a0ab)
* x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591]Noah Goldstein2022-11-241-5/+2
| | | | | | | | | | | | | Previous implementation was adjusting length (rsi) to match bytes (eax), but since there is no bound to length this can cause overflow. Fix is to just convert the byte-count (eax) to length by dividing by sizeof (wchar_t) before the comparison. Full check passes on x86-64 and build succeeds w/ and w/o multiarch. (cherry picked from commit b0969fa53a28b4ab2159806bf6c99a98999502ee)
* x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementationsAurelien Jarno2022-10-042-3/+15
| | | | | | | | | | | | The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk' instruction which belongs to the BMI1 CPU feature and the 'shrx' instruction, which belongs to the BMI2 CPU feature. Fixes: df7e295d18ff ("x86: Optimize {str|wcs}rchr-avx2") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 7e8283170c5d6805b609a040801d819e362a6292)
* x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementationAurelien Jarno2022-10-042-2/+9
| | | | | | | | | | | | The AVX2 memrchr implementation uses the 'shlxl' instruction, which belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which belongs to the LZCNT CPU feature. Fixes: af5306a735eb ("x86: Optimize memrchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 3c0c78afabfed4b6fc161c159e628fbf14ff370b)
* x86-64: Require BMI2 for AVX2 (raw|w)memchr implementationsAurelien Jarno2022-10-041-3/+9
| | | | | | | | | | | The AVX2 memchr, rawmemchr and wmemchr implementations use the 'bzhi' and 'sarx' instructions, which belongs to the BMI2 CPU feature. Fixes: acfd088a1963 ("x86: Optimize memchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit e3e7fab7fe5186d18ca2046d99ba321c27db30ad)
* x86-64: Require BMI2 for AVX2 wcs(n)cmp implementationsAurelien Jarno2022-10-041-2/+6
| | | | | | | | | | | | | | | The AVX2 wcs(n)cmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit f31a5a884ed84bd37032729d4d1eb9d06c9f3c29)
* x86-64: Require BMI2 for AVX2 strncmp implementationAurelien Jarno2022-10-042-4/+7
| | | | | | | | | | | | | | | The AVX2 strncmp implementations uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit fc7de1d9b99ae1676bc626ddca422d7abee0eb48)
* x86-64: Require BMI2 for AVX2 strcmp implementationAurelien Jarno2022-10-042-3/+5
| | | | | | | | | | | | | | | The AVX2 strcmp implementation uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 4d64c6445735e9b34e2ac8e369312cbfc2f88e17)
* x86-64: Require BMI2 for AVX2 str(n)casecmp implementationsAurelien Jarno2022-10-042-8/+21
| | | | | | | | | | | | | | | The AVX2 str(n)casecmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 10f79d3670b036925da63dc532b122d27ce65ff8)
* nptl: Add backoff mechanism to spinlock loopWangyang Guo2022-09-281-0/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mutiple threads waiting for lock at the same time, once lock owner releases the lock, waiters will see lock available and all try to lock, which may cause an expensive CAS storm. Binary exponential backoff with random jitter is introduced. As try-lock attempt increases, there is more likely that a larger number threads compete for adaptive mutex lock, so increase wait time in exponential. A random jitter is also added to avoid synchronous try-lock from other threads. v2: Remove read-check before try-lock for performance. v3: 1. Restore read-check since it works well in some platform. 2. Make backoff arch dependent, and enable it for x86_64. 3. Limit max backoff to reduce latency in large critical section. v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h v5: Commit log updated for regression in large critical section. Result of pthread-mutex-locks bench Test Platform: Xeon 8280L (2 socket, 112 CPUs in total) First Row: thread number First Col: critical section length Values: backoff vs upstream, time based, low is better non-critical-length: 1 1 2 4 8 16 32 64 112 140 0 0.99 0.58 0.52 0.49 0.43 0.44 0.46 0.52 0.54 1 0.98 0.43 0.56 0.50 0.44 0.45 0.50 0.56 0.57 2 0.99 0.41 0.57 0.51 0.45 0.47 0.48 0.60 0.61 4 0.99 0.45 0.59 0.53 0.48 0.49 0.52 0.64 0.65 8 1.00 0.66 0.71 0.63 0.56 0.59 0.66 0.72 0.71 16 0.97 0.78 0.91 0.73 0.67 0.70 0.79 0.80 0.80 32 0.95 1.17 0.98 0.87 0.82 0.86 0.89 0.90 0.90 64 0.96 0.95 1.01 1.01 0.98 1.00 1.03 0.99 0.99 128 0.99 1.01 1.01 1.17 1.08 1.12 1.02 0.97 1.02 non-critical-length: 32 1 2 4 8 16 32 64 112 140 0 1.03 0.97 0.75 0.65 0.58 0.58 0.56 0.70 0.70 1 0.94 0.95 0.76 0.65 0.58 0.58 0.61 0.71 0.72 2 0.97 0.96 0.77 0.66 0.58 0.59 0.62 0.74 0.74 4 0.99 0.96 0.78 0.66 0.60 0.61 0.66 0.76 0.77 8 0.99 0.99 0.84 0.70 0.64 0.66 0.71 0.80 0.80 16 0.98 0.97 0.95 0.76 0.70 0.73 0.81 0.85 0.84 32 1.04 1.12 1.04 0.89 0.82 0.86 0.93 0.91 0.91 64 0.99 1.15 1.07 1.00 0.99 1.01 1.05 0.99 0.99 128 1.00 1.21 1.20 1.22 1.25 1.31 1.12 1.10 0.99 non-critical-length: 128 1 2 4 8 16 32 64 112 140 0 1.02 1.00 0.99 0.67 0.61 0.61 0.61 0.74 0.73 1 0.95 0.99 1.00 0.68 0.61 0.60 0.60 0.74 0.74 2 1.00 1.04 1.00 0.68 0.59 0.61 0.65 0.76 0.76 4 1.00 0.96 0.98 0.70 0.63 0.63 0.67 0.78 0.77 8 1.01 1.02 0.89 0.73 0.65 0.67 0.71 0.81 0.80 16 0.99 0.96 0.96 0.79 0.71 0.73 0.80 0.84 0.84 32 0.99 0.95 1.05 0.89 0.84 0.85 0.94 0.92 0.91 64 1.00 0.99 1.16 1.04 1.00 1.02 1.06 0.99 0.99 128 1.00 1.06 0.98 1.14 1.39 1.26 1.08 1.02 0.98 There is regression in large critical section. But adaptive mutex is aimed for "quick" locks. Small critical section is more common when users choose to use adaptive pthread_mutex. Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 8162147872491bb5b48e91543b19c49a29ae6b6d)
* x86: Add missing IS_IN (libc) check to strncmp-sse4_2.SNoah Goldstein2022-07-181-3/+5
| | | | | | | | | | | | | | | | | | Was missing to for the multiarch build rtld-strncmp-sse4_2.os was being built and exporting symbols: build/glibc/string/rtld-strncmp-sse4_2.os: 0000000000000000 T __strncmp_sse42 Introduced in: commit 11ffcacb64a939c10cfc713746b8ec88837f5c4a Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Jun 21 12:10:50 2017 -0700 x86-64: Implement strcmp family IFUNC selectors in C (cherry picked from commit 96ac447d915ea5ecef3f9168cc13f4e731349a3b)
* x86: Move mem{p}{mov|cpy}_{chk_}erms to its own fileNoah Goldstein2022-07-183-50/+73
| | | | | | | | The primary memmove_{impl}_unaligned_erms implementations don't interact with this function. Putting them in same file both wastes space and unnecessarily bloats a hot code section. (cherry picked from commit 21925f64730d52eb7d8b2fb62b412f8ab92b0caf)
* x86: Move and slightly improve memset_ermsNoah Goldstein2022-07-183-31/+45
| | | | | | | | | | | | | | | | Implementation wise: 1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not use the L(stosb) label that was previously defined. 2. Don't give the hotpath (fallthrough) to zero size. Code positioning wise: Move memset_{chk}_erms to its own file. Leaving it in between the memset_{impl}_unaligned both adds unnecessary complexity to the file and wastes space in a relatively hot cache section. (cherry picked from commit 4a3f29e7e475dd4e7cce2a24c187e6fb7b5b0a05)
* x86: Add definition for __wmemset_chk AVX2 RTM in ifunc impl listNoah Goldstein2022-07-181-0/+4
| | | | | | This was simply missing and meant we weren't testing it properly. (cherry picked from commit 2a1099020cdc1e4c9c928156aa85c8cf9d540291)
* x86: Put wcs{n}len-sse4.1 in the sse4.1 text sectionNoah Goldstein2022-07-183-1/+7
| | | | | | | Previously was missing but the two implementations shouldn't get in the sse2 (generic) text section. (cherry picked from commit afc6e4328ff80973bde50d5401691b4c4b2e522c)
* x86: Align entry for memrchr to 64-bytes.Noah Goldstein2022-07-181-1/+1
| | | | | | | | | | | The function was tuned around 64-byte entry alignment and performs better for all sizes with it. As well different code boths where explicitly written to touch the minimum number of cache line i.e sizes <= 32 touch only the entry cache line. (cherry picked from commit 227afaa67213efcdce6a870ef5086200f1076438)
* x86: Cleanup bounds checking in large memcpy caseNoah Goldstein2022-07-181-8/+21
| | | | | | | | | | | | | | 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x). Previously was using `__x86_rep_movsb_threshold` and should have been using `__x86_shared_non_temporal_threshold`. 2. Avoid reloading __x86_shared_non_temporal_threshold before the L(large_memcpy_4x) bounds check. 3. Document the second bounds check for L(large_memcpy_4x) more clearly. (cherry picked from commit 89a25c6f64746732b87eaf433af0964b564d4a92)
* x86: Add sse42 implementation to strcmp's ifuncNoah Goldstein2022-07-181-0/+5
| | | | | | | | | | | This has been missing since the the ifuncs where added. The performance of SSE4.2 is preferable to to SSE2. Measured on Tigerlake with N = 20 runs. Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906 (cherry picked from commit ff439c47173565fbff4f0f78d07b0f14e4a7db05)
* x86: Align varshift table to 32-bytesNoah Goldstein2022-07-182-3/+5
| | | | | | This ensures the load will never split a cache line. (cherry picked from commit 0f91811333f23b61cf681cab2704b35a0a073b97)
* x86: ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST expect no transactionsNoah Goldstein2022-07-181-3/+3
| | | | | | | | | | Give fall-through path to `vzeroupper` and taken-path to `vzeroall`. Generally even on machines with RTM the expectation is the string-library functions will not be called in transactions. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit c28db9cb29a7d6cf3ce08fd8445e6b7dea03f35b)
* x86: Shrink code size of memchr-evex.SNoah Goldstein2022-07-181-21/+25
| | | | | | | | | | | | | | | This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 64 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 56da3fe1dd075285fa8186d44b3c28e68c687e62)
* x86: Shrink code size of memchr-avx2.SNoah Goldstein2022-07-182-50/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 6dcbb7d95dded20153b12d76d2f4e0ef0cda4f35) x86: Fix page cross case in rawmemchr-avx2 [BZ #29234] commit 6dcbb7d95dded20153b12d76d2f4e0ef0cda4f35 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jun 6 21:11:33 2022 -0700 x86: Shrink code size of memchr-avx2.S Changed how the page cross case aligned string (rdi) in rawmemchr. This was incompatible with how `L(cross_page_continue)` expected the pointer to be aligned and would cause rawmemchr to read data start started before the beginning of the string. What it would read was in valid memory but could count CHAR matches resulting in an incorrect return value. This commit fixes that issue by essentially reverting the changes to the L(page_cross) case as they didn't really matter. Test cases added and all pass with the new code (and where confirmed to fail with the old code). Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 2c9af8421d2b4a7fcce163e7bc81a118d22fd346)
* x86: Optimize memrchr-avx2.SNoah Goldstein2022-07-182-278/+257
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit af5306a735eb0966fdc2f8ccdafa8888e2df0c87)
* x86: Optimize memrchr-evex.SNoah Goldstein2022-07-181-271/+268
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit b4209615a06b01c974f47b4998b00e4c7b1aa5d9)
* x86: Optimize memrchr-sse2.SNoah Goldstein2022-07-181-321/+292
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 731feee3869550e93177e604604c1765d81de571)
* x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret`Noah Goldstein2022-07-182-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This can be useful when hoisting a vzeroupper to save code size for example: ``` L(foo): cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax VZEROUPPER_RETURN L(bar): xorl %eax, %eax VZEROUPPER_RETURN ``` Can become: ``` L(foo): COND_VZEROUPPER cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax ret L(bar): xorl %eax, %eax ret ``` This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit dd5c483b2598f411428df4d8864c15c4b8a3cd68)
* x86: Create header for VEC classes in x86 strings libraryNoah Goldstein2022-07-187-0/+327
| | | | | | | | | | | | This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 8a780a6b910023e71f3173f37f0793834c047554)
* x86_64: Add strstr function with 512-bit EVEXRaghuveer Devulapalli2022-07-184-4/+246
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding a 512-bit EVEX version of strstr. The algorithm works as follows: (1) We spend a few cycles at the begining to peek into the needle. We locate an edge in the needle (first occurance of 2 consequent distinct characters) and also store the first 64-bytes into a zmm register. (2) We search for the edge in the haystack by looking into one cache line of the haystack at a time. This avoids having to read past a page boundary which can cause a seg fault. (3) If an edge is found in the haystack we first compare the first 64-bytes of the needle (already stored in a zmm register) before we proceed with a full string compare performed byte by byte. Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512) Geometric mean of all benchmarks: new / old = 0.66 Difficult skiptable(0) : new / old = 0.02 Difficult skiptable(1) : new / old = 0.01 Difficult 2-way : new / old = 0.25 Difficult testing first 2 : new / old = 1.26 Difficult skiptable(0) : new / old = 0.05 Difficult skiptable(1) : new / old = 0.06 Difficult 2-way : new / old = 0.26 Difficult testing first 2 : new / old = 1.05 Difficult skiptable(0) : new / old = 0.42 Difficult skiptable(1) : new / old = 0.24 Difficult 2-way : new / old = 0.21 Difficult testing first 2 : new / old = 1.04 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 5082a287d5e9a1f9cb98b7c982a708a3684f1d5c) x86: Remove __mmask intrinsics in strstr-avx512.c The intrinsics are not available before GCC7 and using standard operators generates code of equivalent or better quality. Removed: _cvtmask64_u64 _kshiftri_mask64 _kand_mask64 Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958 (cherry picked from commit f2698954ff9c2f9626d4bcb5a30eb5729714e0b0)
* x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOTH.J. Lu2022-07-181-2/+4
| | | | | | | | | According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we can ignore their r_addends. Reviewed-by: Fangrui Song <maskray@google.com> (cherry picked from commit f8587a61892cbafd98ce599131bf4f103466f084)
* x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlenSunil K Pandey2022-07-187-0/+346
| | | | | | | | | | | | | | | | This patch implements following evex512 version of string functions. Perf gain for evex512 version is up to 50% as compared to evex, depending on length and alignment. Placeholder function, not used by any processor at the moment. - String length function using 512 bit vectors. - String N length using 512 bit vectors. - Wide string length using 512 bit vectors. - Wide string N length using 512 bit vectors. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 9c66efb86fe384f77435f7e326333fb2e4e10676)
* x86_64: Remove bzero optimizationAdhemerval Zanella2022-07-1811-238/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both symbols are marked as legacy in POSIX.1-2001 and removed on POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE or _DEFAULT_SOURCE. GCC also replaces bcopy with a memmove and bzero with memset on default configuration (to actually get a bzero libc call the code requires to omit string.h inclusion and built with -fno-builtin), so it is highly unlikely programs are actually calling libc bzero symbol. On a recent Linux distro (Ubuntu 22.04), there is no bzero calls by the installed binaries. $ cat count_bstring.sh #!/bin/bash files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done` total=0 for file in $files; do symbols=`objdump -R $file 2>&1` if [ $? -eq 0 ]; then ncalls=`echo $symbols | grep -w $1 | wc -l` ((total=total+ncalls)) if [ $ncalls -gt 0 ]; then echo "$file: $ncalls" fi fi done echo "TOTAL=$total" $ ./count_bstring.sh bzero TOTAL=0 Checked on x86_64-linux-gnu. (cherry picked from commit 9403b71ae97e3f1a91c796ddcbb4e6f044434734)
* x86_64: Remove end of line trailing spacesSunil K Pandey2022-07-181-2/+2
| | | | | | | | | | This commit remove trailing space introduced by following commit. commit a775a7a3eb1e85b54af0b4ee5ff4dcf66772a1fb Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 23 01:56:29 2021 -0400 x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]
* x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #29127]Noah Goldstein2022-05-251-6/+2
| | | | | | | | | | | | | | | | | | | Re-cherry-pick commit c627209832 for strcmp-avx2.S change which was omitted in intial cherry pick because at the time this bug was not present on release branch. Fixes BZ #29127. In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would call strcmp-avx2 and wcscmp-avx2 respectively. This would have not checks around vzeroupper and would trigger spurious aborts. This commit fixes that. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on AVX2 machines with and without RTM. Co-authored-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit c6272098323153db373f2986c67786ea8c85f1cf)
* x86: Optimize {str|wcs}rchr-evexNoah Goldstein2022-05-161-181/+290
| | | | | | | | | | | | | The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.755 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit c966099cdc3e0fdf92f63eac09b22fa7e5f5f02d)
* x86: Optimize {str|wcs}rchr-avx2Noah Goldstein2022-05-161-157/+269
| | | | | | | | | | | | | The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.832 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit df7e295d18ffa34f629578c0017a9881af7620f6)
* x86: Optimize {str|wcs}rchr-sse2Noah Goldstein2022-05-164-443/+338
| | | | | | | | | | | | | The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.741 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 5307aa9c1800f36a64c183c091c9af392c1fa75c)
* x86: Cleanup page cross code in memcmp-avx2-movbe.SNoah Goldstein2022-05-161-37/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Old code was both inefficient and wasted code size. New code (-62 bytes) and comparable or better performance in the page cross case. geometric_mean(N=20) of page cross cases New / Original: 0.960 size, align0, align1, ret, New Time/Old Time 1, 4095, 0, 0, 1.001 1, 4095, 0, 1, 0.999 1, 4095, 0, -1, 1.0 2, 4094, 0, 0, 1.0 2, 4094, 0, 1, 1.0 2, 4094, 0, -1, 1.0 3, 4093, 0, 0, 1.0 3, 4093, 0, 1, 1.0 3, 4093, 0, -1, 1.0 4, 4092, 0, 0, 0.987 4, 4092, 0, 1, 1.0 4, 4092, 0, -1, 1.0 5, 4091, 0, 0, 0.984 5, 4091, 0, 1, 1.002 5, 4091, 0, -1, 1.005 6, 4090, 0, 0, 0.993 6, 4090, 0, 1, 1.001 6, 4090, 0, -1, 1.003 7, 4089, 0, 0, 0.991 7, 4089, 0, 1, 1.0 7, 4089, 0, -1, 1.001 8, 4088, 0, 0, 0.875 8, 4088, 0, 1, 0.881 8, 4088, 0, -1, 0.888 9, 4087, 0, 0, 0.872 9, 4087, 0, 1, 0.879 9, 4087, 0, -1, 0.883 10, 4086, 0, 0, 0.878 10, 4086, 0, 1, 0.886 10, 4086, 0, -1, 0.873 11, 4085, 0, 0, 0.878 11, 4085, 0, 1, 0.881 11, 4085, 0, -1, 0.879 12, 4084, 0, 0, 0.873 12, 4084, 0, 1, 0.889 12, 4084, 0, -1, 0.875 13, 4083, 0, 0, 0.873 13, 4083, 0, 1, 0.863 13, 4083, 0, -1, 0.863 14, 4082, 0, 0, 0.838 14, 4082, 0, 1, 0.869 14, 4082, 0, -1, 0.877 15, 4081, 0, 0, 0.841 15, 4081, 0, 1, 0.869 15, 4081, 0, -1, 0.876 16, 4080, 0, 0, 0.988 16, 4080, 0, 1, 0.99 16, 4080, 0, -1, 0.989 17, 4079, 0, 0, 0.978 17, 4079, 0, 1, 0.981 17, 4079, 0, -1, 0.98 18, 4078, 0, 0, 0.981 18, 4078, 0, 1, 0.98 18, 4078, 0, -1, 0.985 19, 4077, 0, 0, 0.977 19, 4077, 0, 1, 0.979 19, 4077, 0, -1, 0.986 20, 4076, 0, 0, 0.977 20, 4076, 0, 1, 0.986 20, 4076, 0, -1, 0.984 21, 4075, 0, 0, 0.977 21, 4075, 0, 1, 0.983 21, 4075, 0, -1, 0.988 22, 4074, 0, 0, 0.983 22, 4074, 0, 1, 0.994 22, 4074, 0, -1, 0.993 23, 4073, 0, 0, 0.98 23, 4073, 0, 1, 0.992 23, 4073, 0, -1, 0.995 24, 4072, 0, 0, 0.989 24, 4072, 0, 1, 0.989 24, 4072, 0, -1, 0.991 25, 4071, 0, 0, 0.99 25, 4071, 0, 1, 0.999 25, 4071, 0, -1, 0.996 26, 4070, 0, 0, 0.993 26, 4070, 0, 1, 0.995 26, 4070, 0, -1, 0.998 27, 4069, 0, 0, 0.993 27, 4069, 0, 1, 0.999 27, 4069, 0, -1, 1.0 28, 4068, 0, 0, 0.997 28, 4068, 0, 1, 1.0 28, 4068, 0, -1, 0.999 29, 4067, 0, 0, 0.996 29, 4067, 0, 1, 0.999 29, 4067, 0, -1, 0.999 30, 4066, 0, 0, 0.991 30, 4066, 0, 1, 1.001 30, 4066, 0, -1, 0.999 31, 4065, 0, 0, 0.988 31, 4065, 0, 1, 0.998 31, 4065, 0, -1, 0.998 Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 23102686ec67b856a2d4fd25ddaa1c0b8d175c4f)
* x86: Remove memcmp-sse4.SNoah Goldstein2022-05-164-814/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code didn't actually use any sse4 instructions since `ptest` was removed in: commit 2f9062d7171850451e6044ef78d91ff8c017b9c0 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Nov 10 16:18:56 2021 -0600 x86: Shrink memcmp-sse4.S code size The new memcmp-sse2 implementation is also faster. geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905 Note there are two regressions preferring SSE2 for Size = 1 and Size = 65. Size = 1: size, align0, align1, ret, New Time/Old Time 1, 1, 1, 0, 1.2 1, 1, 1, 1, 1.197 1, 1, 1, -1, 1.2 This is intentional. Size == 1 is significantly less hot based on profiles of GCC11 and Python3 than sizes [4, 8] (which is made hotter). Python3 Size = 1 -> 13.64% Python3 Size = [4, 8] -> 60.92% GCC11 Size = 1 -> 1.29% GCC11 Size = [4, 8] -> 33.86% size, align0, align1, ret, New Time/Old Time 4, 4, 4, 0, 0.622 4, 4, 4, 1, 0.797 4, 4, 4, -1, 0.805 5, 5, 5, 0, 0.623 5, 5, 5, 1, 0.777 5, 5, 5, -1, 0.802 6, 6, 6, 0, 0.625 6, 6, 6, 1, 0.813 6, 6, 6, -1, 0.788 7, 7, 7, 0, 0.625 7, 7, 7, 1, 0.799 7, 7, 7, -1, 0.795 8, 8, 8, 0, 0.625 8, 8, 8, 1, 0.848 8, 8, 8, -1, 0.914 9, 9, 9, 0, 0.625 Size = 65: size, align0, align1, ret, New Time/Old Time 65, 0, 0, 0, 1.103 65, 0, 0, 1, 1.216 65, 0, 0, -1, 1.227 65, 65, 0, 0, 1.091 65, 0, 65, 1, 1.19 65, 65, 65, -1, 1.215 This is because A) the checks in range [65, 96] are now unrolled 2x and B) because smaller values <= 16 are now given a hotter path. By contrast the SSE4 version has a branch for Size = 80. The unrolled version has get better performance for returns which need both comparisons. size, align0, align1, ret, New Time/Old Time 128, 4, 8, 0, 0.858 128, 4, 8, 1, 0.879 128, 4, 8, -1, 0.888 As well, out of microbenchmark environments that are not full predictable the branch will have a real-cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 7cbc03d03091d5664060924789afe46d30a5477e)
* x86: Small improvements for wcslenNoah Goldstein2022-05-161-45/+41
| | | | | | | | | | | | | | | | Just a few QOL changes. 1. Prefer `add` > `lea` as it has high execution units it can run on. 2. Don't break macro-fusion between `test` and `jcc` 3. Reduce code size by removing gratuitous padding bytes (-90 bytes). geometric_mean(N=20) of all benchmarks New / Original: 0.959 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 244b415d386487521882debb845a040a4758cb18)
* x86: Remove AVX str{n}casecmpNoah Goldstein2022-05-166-197/+105
| | | | | | | | | | | | | | | | | | | The rational is: 1. SSE42 has nearly identical logic so any benefit is minimal (3.4% regression on Tigerlake using SSE42 versus AVX across the benchtest suite). 2. AVX2 version covers the majority of targets that previously prefered it. 3. The targets where AVX would still be best (SnB and IVB) are becoming outdated. All in all the saving the code size is worth it. All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 305769b2a15c2e96f9e1b5195d3c4e0d6f0f4b68)
* x86: Add EVEX optimized str{n}casecmpNoah Goldstein2022-05-166-40/+321
| | | | | | | | | geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 84e7c46df4086873eae28a1fb87d2cf5388b1e16)
* x86: Add AVX2 optimized str{n}casecmpNoah Goldstein2022-05-168-31/+331
| | | | | | | | | geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit bbf81222343fed5cd704001a2ae0d86c71544151)
* x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.SNoah Goldstein2022-05-161-48/+35
| | | | | | | | | | | | | | | Slightly faster method of doing TOLOWER that saves an instruction. Also replace the hard coded 5-byte no with .p2align 4. On builds with CET enabled this misaligned entry to strcasecmp. geometric_mean(N=40) of all benchmarks New / Original: .920 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit d154758e618ec9324f5d339c46db0aa27e8b1226)
* x86: Optimize str{n}casecmp TOLOWER logic in strcmp.SNoah Goldstein2022-05-161-35/+29
| | | | | | | | | | | | | | | Slightly faster method of doing TOLOWER that saves an instruction. Also replace the hard coded 5-byte no with .p2align 4. On builds with CET enabled this misaligned entry to strcasecmp. geometric_mean(N=40) of all benchmarks New / Original: .894 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 670b54bc585ea4a94f3b2e9272ba44aa6b730b73)
* x86: Remove strspn-sse2.S and use the generic implementationNoah Goldstein2022-05-162-118/+3
| | | | | | | | | | | The generic implementation is faster. geometric_mean(N=20) of all benchmarks New / Original: .710 All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 9c8a6ad620b49a27120ecdd7049c26bf05900397)
* x86: Remove strpbrk-sse2.S and use the generic implementationNoah Goldstein2022-05-162-7/+3
| | | | | | | | | The generic implementation is faster (see strcspn commit). All string/memory tests pass. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 653358535280a599382cb6c77538a187dac6a87f)