about summary refs log tree commit diff
Commit message (Collapse)AuthorAgeFilesLines
* x86: Replace all sse instructions with vex equivilent in avx+ filesNoah Goldstein2022-06-2275-158/+158
| | | | | | | | | | | | | Most of these don't really matter as there was no dirty upper state but we should generally avoid stray sse when its not needed. The one case that really matters is in svml_d_tanh4_core_avx2.S: blendvps %xmm0, %xmm8, %xmm7 When there was a dirty upper state. Tested on x86_64-linux
* x86: Add support for compiling {raw|w}memchr with high ISA levelNoah Goldstein2022-06-2217-604/+720
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Refactor files so that all implementations for in the multiarch directory. - Essentially moved sse2 {raw|w}memchr.S implementation to multiarch/{raw|w}memchr-sse2.S - The non-multiarch {raw|w}memchr.S file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memchr-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memchr-evex{-rtm}.S). 3. Add new multiarch/rtld-{raw}memchr.S that just include the non-multiarch {raw}memchr.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. - Guranteed replacement essentially means that for any ISA level build there must be a function that the baseline of the ISA supports. So for {raw|w}memchr.S since there is not ISA level 2 function, the ISA level 2 build still includes the ISA level 1 (sse2) function. Once we reach the ISA level 3 build, however, {raw|w}memchr-avx2{-rtm}.S will always be sufficient so the ISA level 1 implementation ({raw|w}memchr-sse2.S) will not be built. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add defines / utilities for making ISA specific x86 buildsNoah Goldstein2022-06-225-13/+229
| | | | | | | | | | | | | | | 1. Factor out some of the ISA level defines in isa-level.c to standalone header isa-level.h 2. Add new headers with ISA level dependent macros for handling ifuncs. Note, this file does not change any code. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* stdlib: Remove attr_write from mbstows if dst is NULL [BZ: 29265]Noah Goldstein2022-06-223-5/+21
| | | | | | | | mbstows is defined if dst is NULL and is defined to special cased if dst is NULL so the fortify objsize check if incorrect in that case. Tested on x86-64 linux. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* stdlib: Remove trailing whitespace from MakefileNoah Goldstein2022-06-221-1/+1
| | | | | This causes precommit tests to fail when pushing commits that modify this file.
* debug: make __read_chk a cancellation point (bug 29274)Andreas Schwab2022-06-223-10/+57
| | | | | The __read_chk function, as the implementation behind the fortified read function, must be a cancellation point, thus it cannot use INLINE_SYSCALL.
* s390: use LC_ALL=C for readelf callSam James2022-06-212-2/+2
| | | | | | | | | | Let's use LC_ALL=C as we do elsewhere for consistency. Tested on s390x-ibm-linux-gnu. See: 72bd208846535725ea28b8173e79ef60e57a968c Signed-off-by: Sam James <sam@gentoo.org> Reviewed-by: Stefan Liebler <stli@linux.ibm.com>
* s390: use $READELFSam James2022-06-212-2/+2
| | | | | | | | | | We already check for it in root configure.ac with AC_CHECK_TOOL. Let's use the result. Tested on s390x-ibm-linux-gnu. Signed-off-by: Sam James <sam@gentoo.org> Reviewed-by: Stefan Liebler <stli@linux.ibm.com>
* i386: Fix include paths for strspn, strcspn, and strpbrkNoah Goldstein2022-06-173-6/+6
| | | | | | | | | | | | | | commit c22eb807b0c8125101f6a274795425be2bbd0386 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Thu Jun 16 15:07:12 2022 -0700 x86: Rename generic functions with unique postfix for clarity Changed the names of the strspn-c, strcspn-c, and strpbrk-c files in a general refactor. It didn't change the include paths for the i386 files breaking the i386 build. This commit fixes that. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
* elf: Silence GCC 11/12 false positive warningH.J. Lu2022-06-171-0/+10
| | | | | | | | | | Silence GCC 11/12 false positive warning with -mavx512f on dl-load.c: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106008 $ gcc -O2 -fPIC -march=x86-64 -mavx512f -S -Wall ... dl-load.c: In function ‘_dl_map_object_from_fd.constprop’: dl-load.c:1158:30: warning: ‘(((char *)loadcmds.113_68 + _933 + 16))[329406144173384849].mapend’ may be used uninitialized [-Wmaybe-uninitialized]
* x86: Rename generic functions with unique postfix for clarityNoah Goldstein2022-06-1629-76/+190
| | | | | | | | | | No functions are changed. It just renames generic implementations from '{func}_sse2' to '{func}_generic'. This is just because the postfix "_sse2" was overloaded and was used for files that had hand-optimized sse2 assembly implementations and files that just redirected back to the generic implementation. Full xcheck passed on x86_64.
* x86: Add BMI1/BMI2 checks for ISA_V3 checkNoah Goldstein2022-06-161-1/+2
| | | | | | | BMI1/BMI2 are part of the ISA V3 requirements: https://en.wikipedia.org/wiki/X86-64 And defined by GCC when building with `-march=x86-64-v3`
* x86-64: Handle fewer relocation types for RTLD_BOOTSTRAPFangrui Song2022-06-161-26/+6
| | | | | | | | | | The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only needs to handle RELATIVE, GLOB_DAT, and JUMP_SLOT. RELATIVE has been handled (by _ELF_DYNAMIC_DO_RELOC due to DT_RELACOUNT, or RELR), so the switch statement only needs to handle GLOB_DAT and JUMP_SLOT. We can drop these `#if[n]def RTLD_BOOTSTRAP` and add a large `# ifndef RTLD_BOOTSTRAP` instead.
* aarch64: Handle fewer relocations for RTLD_BOOTSTRAPFangrui Song2022-06-151-18/+15
| | | | | | | | | The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only needs to handle RELATIVE, GLOB_DAT, and JUMP_SLOT. TLSDESC/TLS_DTPMOD/TLS_DTPREL handling can be removed. Remove `case AARCH64_R(RELATIVE)` as well as elf_machine_rela has checked it. Tested on aarch64-linux-gnu.
* riscv: Change the relocations handled for RTLD_BOOTSTRAPFangrui Song2022-06-151-13/+10
| | | | | | | | | | | | The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only needs to handle RELATIVE, GLOB_DAT, and the symbolic relocation type (R_RISCV_{32,64}). NONE and IRELATIVE can be removed. The code relies on ld.so having DT_RELACOUNT so that the RTLD_BOOTSTRAP branch does not need handle RELATIVE. Drop this minor size optimization for clarity. Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
* x86: Cleanup bounds checking in large memcpy caseNoah Goldstein2022-06-151-8/+21
| | | | | | | | | | | | 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x). Previously was using `__x86_rep_movsb_threshold` and should have been using `__x86_shared_non_temporal_threshold`. 2. Avoid reloading __x86_shared_non_temporal_threshold before the L(large_memcpy_4x) bounds check. 3. Document the second bounds check for L(large_memcpy_4x) more clearly.
* x86: Add bounds `x86_non_temporal_threshold`Noah Goldstein2022-06-152-2/+8
| | | | | | | | | | | | | | | The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed by memmove-vec-unaligned-erms. The lower-bound is needed because memmove-vec-unaligned-erms unrolls the loop aggressively in the L(large_memset_4x) case. The upper-bound is needed because memmove-vec-unaligned-erms right-shifts the value of `x86_non_temporal_threshold` by LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow. The lack of lower-bound can be a correctness issue. The lack of upper-bound cannot.
* Remove remnant reference to ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATAFangrui Song2022-06-152-6/+2
| | | | This fixes nios2 build after commit de38b2a343e6d64b95c50004943d6107a9e380d0.
* elf: Remove ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATAFangrui Song2022-06-157-124/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If an executable has copy relocations for extern protected data, that can only work if the library containing the definition is built with assumptions (a) the compiler emits GOT-generating relocations (b) the linker produces R_*_GLOB_DAT instead of R_*_RELATIVE. Otherwise the library uses its own definition directly and the executable accesses a stale copy. Note: the GOT relocations defeat the purpose of protected visibility as an optimization, but allow rtld to make the executable and library use the same copy when copy relocations are present, but it turns out this never worked perfectly. ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange semantics when both a.so and b.so define protected var and the executable copy relocates var: b.so accesses its own copy even with GLOB_DAT. The behavior change is from commit 62da1e3b00b51383ffa7efc89d8addda0502e107 (x86) and then copied to nios2 (ae5eae7cfc9c4a8297ff82ec6b794faca1976ecc) and arc (0e7d930c4c11de896fe807f67fa1eb756c9c1e05). Without ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA, b.so accesses the copy relocated data like a.so. There is now a warning for copy relocation on protected symbol since commit 7374c02b683b7110b853a32496a619410364d70b. It's extremely unlikely anyone relies on the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA behavior, so let's remove it: this removes a check in the symbol lookup code.
* x86: Add sse42 implementation to strcmp's ifuncNoah Goldstein2022-06-141-0/+5
| | | | | | | | | This has been missing since the the ifuncs where added. The performance of SSE4.2 is preferable to to SSE2. Measured on Tigerlake with N = 20 runs. Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
* x86: Fix misordered logic for setting `rep_movsb_stop_threshold`Noah Goldstein2022-06-141-12/+12
| | | | | | | | Move the setting of `rep_movsb_stop_threshold` to after the tunables have been collected so that the `rep_movsb_stop_threshold` (which is used to redirect control flow to the non_temporal case) will use any user value for `non_temporal_threshold` (set using glibc.cpu.x86_non_temporal_threshold)
* elf: Refine direct extern access diagnostics to protected symbolFangrui Song2022-06-141-23/+27
| | | | | | | | | | | | | | | | | | | | | | Refine commit 349b0441dab375099b1d7f6909c1742286a67da9: 1. Copy relocations for extern protected data do not work properly, regardless whether GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS is used. It makes sense to produce a warning unconditionally. 2. Non-zero value of an undefined function symbol may break pointer equality, but may be benign in many cases (many programs don't take the address in the shared object then compare it with the address in the executable). Reword the diagnostic to be clearer. 3. Remove the unneeded condition !(undef_map->l_1_needed & GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS). If the executable does not not have GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS (can only occur in error cases), the diagnostic should be emitted as well. When the defining shared object has GNU_PROPERTY_1_NEEDED_INDIRECT_EXTERN_ACCESS, report an error to apply the intended enforcement.
* Avoid -Wstringop-overflow= warning in iconv module.Stefan Liebler2022-06-141-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On s390x when compiling with GCC 12, I get this warning: utf8-utf16-z9.c: ../iconv/loop.c: In function ‘__from_utf8_loop_etf3eh_single’: ../iconv/loop.c:445:22: error: writing 1 byte into a region of size 0 [-Werror=stringop-overflow=] 445 | bytebuf[inlen++] = *inptr++; | ~~~~~~~~~~~~~~~~~^~~~~~~~~~ ../iconv/loop.c:381:17: note: at offset 4 into destination object ‘bytebuf’ of size 4 381 | unsigned char bytebuf[MAX_NEEDED_INPUT]; | ^~~~~~~ ../iconv/loop.c:445:22: error: writing 1 byte into a region of size 0 [-Werror=stringop-overflow=] 445 | bytebuf[inlen++] = *inptr++; | ~~~~~~~~~~~~~~~~~^~~~~~~~~~ ../iconv/loop.c:381:17: note: at offset 5 into destination object ‘bytebuf’ of size 4 381 | unsigned char bytebuf[MAX_NEEDED_INPUT]; | ^~~~~~~ This patch tells the compiler that inend is always behind inptr which avoids the warning. Note that the SINGLE function is only used to implement the mb*towc*() or wc*tomb*() functions. Those functions use inptr and inend pointing to a variable on stack, compute the inend pointer or explicitly check the arguments which always leads to inptr < inend. Special notes for backporters (according to Siddhesh Poyarekar): If someone wants to backport this patch to release branches, they should also backport the following wcrtomb change. Otherwise the assumptions assumed by this patch are not true. commit 9bcd12d223a8990254b65e2dada54faa5d2742f3 Author: Siddhesh Poyarekar <siddhesh@sourceware.org> Date: Fri May 13 19:10:15 2022 +0530 wcrtomb: Make behavior POSIX compliant Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* Add bounds check to __libc_ifunc_impl_listWilco Dijkstra2022-06-109-50/+20
| | | | | | | | | | | | Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC redundant and fixes several targets that will write outside the array. To avoid unnecessary large diffs, pass the maximum in the argument 'i' to IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing ones can be updated if desired. Passes buildmanyglibc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* libio: Avoid RMW of flags2 outside lock (BZ #27842)Wilco Dijkstra2022-06-101-1/+0
| | | | | | | | Remove an unconditional RMW on flags2 in flockfile - we don't need to change _IO_FLAGS2_NEED_LOCK since it isn't used in flockfile or funlockfile. This fixes BZ #27842. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86: Optimize svml_s_tanhf4_core_sse4.SNoah Goldstein2022-06-091-727/+138
| | | | | | | | | | | | | | | Optimizations are: 1. Reduce code size (-112 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Prefer registers which get short instruction encoding. 5. Reduce rodata size (-4k+ rodata is shared with avx2). Result is roughly a 15-16% speedup: Function, New Time, Old Time, New / Old _ZGVbN4v_tanhf, 3.158, 3.749, 0.842
* x86: Optimize svml_s_tanhf8_core_avx2.SNoah Goldstein2022-06-091-741/+171
| | | | | | | | | | | | | | | Optimizations are: 1. Reduce code size (-81 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Prefer registers which get short instruction encoding. 5. Reduce rodata size (-32 bytes). Result is roughly a 17-18% speedup: Function, New Time, Old Time, New / Old _ZGVdN8v_tanhf, 1.977, 2.402, 0.823
* x86: Add data file that can be shared by tanhf-avx2 and tanhf-sse4Noah Goldstein2022-06-091-0/+621
| | | | | | | | | | tanhf-avx2 and tanhf-sse4 use the same data tables so we can save over 4kb using a shared datatable. This does increase the memory footprint of the sse4 version (as now all the targets are 32 bytes instead of 16), generally it seems worth the code size save. NB: This patch doesn't do anything itself, it is setup for future patches.
* x86: Optimize svml_s_tanhf16_core_avx512.SNoah Goldstein2022-06-091-240/+287
| | | | | | | | | | | | | | Optimizations are: 1. Reduce code size (-67 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Reduce rodata usage (-448 bytes). Result is roughly a 14% speedup: Function, New Time, Old Time, New / Old _ZGVeN16v_tanhf, 0.649, 0.752, 0.863
* x86: Improve svml_s_atanhf4_core_sse4.SNoah Goldstein2022-06-091-209/+169
| | | | | | | | | | | | | | | | Improvements are: 1. Reduce code size (-62 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Prefer registers which get short instruction encoding. 5. Reduce rodata usage (-16 bytes). The throughput improvement is not significant as the port 0 bottleneck is unavoidable. Function, New Time, Old Time, New / Old _ZGVbN4v_atanhf, 8.821, 8.903, 0.991
* x86: Improve svml_s_atanhf8_core_avx2.SNoah Goldstein2022-06-091-203/+202
| | | | | | | | | | | | | | | | Improvements are: 1. Reduce code size (-60 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Prefer registers which get short instruction encoding. 5. Shrink rodata usage (-32 bytes). The throughput improvement is not that significant (3-5%) as the port 0 bottleneck is unavoidable. Function, New Time, Old Time, New / Old _ZGVdN8v_atanhf, 2.799, 2.923, 0.958
* x86: Improve svml_s_atanhf16_core_avx512.SNoah Goldstein2022-06-091-230/+244
| | | | | | | | | | | | | | | Improvements are: 1. Reduce code size (-64 bytes). 2. Remove redundant move instructions. 3. Slightly improve instruction selection/scheduling where possible. 4. Reduce rodata size ([-128, -188] bytes). The throughput improvement is not significant as the port 0 bottleneck is unavoidable. Function, New Time, Old Time, New / Old _ZGVeN16v_atanhf, 1.39, 1.408, 0.987
* x86: Align varshift table to 32-bytesNoah Goldstein2022-06-092-3/+5
| | | | This ensures the load will never split a cache line.
* x86: Add copyright to strpbrk-c.cNoah Goldstein2022-06-091-0/+18
|
* nss: handle stat failure in check_reload_and_get (BZ #28752)Sam James2022-06-081-15/+24
| | | | | | | | | | | | | | | | | Skip the chroot test if the database isn't loaded correctly (because the chroot test uses some existing DB state). The __stat64_time64 -> fstatat call can fail if running under an (aggressive) seccomp filter, like Firefox seems to use. This manifested in a crash when using glib built with FAM support with such a Firefox build. Suggested-by: DJ Delorie <dj@redhat.com> Signed-off-by: Sam James <sam@gentoo.org> Reviewed-by: DJ Delorie <dj@redhat.com>
* nss: add assert to DB_LOOKUP_FCT (BZ #28752)Sam James2022-06-081-0/+5
| | | | | | | | | It's interesting if we have a null action list, so an assert is worthwhile. Suggested-by: DJ Delorie <dj@redhat.com> Signed-off-by: Sam James <sam@gentoo.org> Reviewed-by: DJ Delorie <dj@redhat.com>
* x86: Fix page cross case in rawmemchr-avx2 [BZ #29234]Noah Goldstein2022-06-082-9/+64
| | | | | | | | | | | | | | | | | | | | | | | commit 6dcbb7d95dded20153b12d76d2f4e0ef0cda4f35 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Mon Jun 6 21:11:33 2022 -0700 x86: Shrink code size of memchr-avx2.S Changed how the page cross case aligned string (rdi) in rawmemchr. This was incompatible with how `L(cross_page_continue)` expected the pointer to be aligned and would cause rawmemchr to read data start started before the beginning of the string. What it would read was in valid memory but could count CHAR matches resulting in an incorrect return value. This commit fixes that issue by essentially reverting the changes to the L(page_cross) case as they didn't really matter. Test cases added and all pass with the new code (and where confirmed to fail with the old code). Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* nptl_db: disable DT_RELR on libthread_db.soPaul E. Murphy2022-06-081-0/+6
| | | | | | | | | | | | | | | | | | | | Some nptl tests inadvertently use the host's gdb to verify libthread_db.so, which is loaded with the host's runtime. This causes a couple of test failures when the host glibc does not support DT_RELR. The not correct, but simple, workaround is to build without DT_RELR as this library is otherwise likely to load on glibc 2.17 and newer today. This allows tst-pthread-gdb-attach{,-static} to continue working when testing on a gdb loaded with an older glibc. This avoids a failure in tst-pthread-gdb-attach similar to: Trying host libthread_db library: .../build/glibc/nptl_db/libthread_db.so.1. dlopen failed: /lib64/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by .../build/glibc/nptl_db/libthread_db.so.1). Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* elf: add missing newlines in lateglobal testAndreas Schwab2022-06-081-3/+3
|
* nptl: Fix __libc_cleanup_pop_restore asynchronous restore (BZ#29214)Adhemerval Zanella2022-06-083-1/+85
| | | | | | This was due a wrong revert done on 404656009b459658. Checked on x86_64-linux-gnu.
* x86: ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST expect no transactionsNoah Goldstein2022-06-071-3/+3
| | | | | | | | Give fall-through path to `vzeroupper` and taken-path to `vzeroall`. Generally even on machines with RTM the expectation is the string-library functions will not be called in transactions. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Shrink code size of memchr-evex.SNoah Goldstein2022-06-071-21/+25
| | | | | | | | | | | | | This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 64 bytes There are no non-negligible changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 1.000 Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Shrink code size of memchr-avx2.SNoah Goldstein2022-06-072-50/+60
| | | | | | | | | | | | | This is not meant as a performance optimization. The previous code was far to liberal in aligning targets and wasted code size unnecissarily. The total code size saving is: 59 bytes There are no major changes in the benchmarks. Geometric Mean of all benchmarks New / Old: 0.967 Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memrchr-avx2.SNoah Goldstein2022-06-072-278/+257
| | | | | | | | | | | | | | | | | | | | | | | | | | The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 306 bytes Geometric Mean of all benchmarks New / Old: 0.760 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 10-20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 15-45% speedup. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memrchr-evex.SNoah Goldstein2022-06-071-271/+268
| | | | | | | | | | | | | | | | | | | | | | | | | | The new code: 1. prioritizes smaller user-arg lengths more. 2. optimizes target placement more carefully 3. reuses logic more 4. fixes up various inefficiencies in the logic. The biggest case here is the `lzcnt` logic for checking returns which saves either a branch or multiple instructions. The total code size saving is: 263 bytes Geometric Mean of all benchmarks New / Old: 0.755 Regressions: There are some regressions. Particularly where the length (user arg length) is large but the position of the match char is near the beginning of the string (in first VEC). This case has roughly a 20% regression. This is because the new logic gives the hot path for immediate matches to shorter lengths (the more common input). This case has roughly a 35% speedup. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memrchr-sse2.SNoah Goldstein2022-06-071-321/+292
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new code: 1. prioritizes smaller lengths more. 2. optimizes target placement more carefully. 3. reuses logic more. 4. fixes up various inefficiencies in the logic. The total code size saving is: 394 bytes Geometric Mean of all benchmarks New / Old: 0.874 Regressions: 1. The page cross case is now colder, especially re-entry from the page cross case if a match is not found in the first VEC (roughly 50%). My general opinion with this patch is this is acceptable given the "coldness" of this case (less than 4%) and generally performance improvement in the other far more common cases. 2. There are some regressions 5-15% for medium/large user-arg lengths that have a match in the first VEC. This is because the logic was rewritten to optimize finds in the first VEC if the user-arg length is shorter (where we see roughly 20-50% performance improvements). It is not always the case this is a regression. My intuition is some frontend quirk is partially explaining the data although I haven't been able to find the root cause. Full xcheck passes on x86_64. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* Benchtests: Improve memrchr benchmarksNoah Goldstein2022-06-071-45/+65
| | | | | | | | | | Add a second iteration for memrchr to set `pos` starting from the end of the buffer. Previously `pos` was only set relative to the beginning of the buffer. This isn't really useful for memrchr because the beginning of the search space is (buf + len). Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret`Noah Goldstein2022-06-072-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RTM vzeroupper mitigation has no way of replacing inline vzeroupper not before a return. This can be useful when hoisting a vzeroupper to save code size for example: ``` L(foo): cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax VZEROUPPER_RETURN L(bar): xorl %eax, %eax VZEROUPPER_RETURN ``` Can become: ``` L(foo): COND_VZEROUPPER cmpl %eax, %edx jz L(bar) tzcntl %eax, %eax addq %rdi, %rax ret L(bar): xorl %eax, %eax ret ``` This code does not change any existing functionality. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Create header for VEC classes in x86 strings libraryNoah Goldstein2022-06-077-0/+327
| | | | | | | | | | This patch does not touch any existing code and is only meant to be a tool for future patches so that simple source files can more easily be maintained to target multiple VEC classes. There is no difference in the objdump of libc.so before and after this patch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* powerpc: Fix VSX register number on __strncpy_power9 [BZ #29197]Matheus Castanho2022-06-071-2/+2
| | | | | | | | | | | | | | | __strncpy_power9 initializes VR 18 with zeroes to be used throughout the code, including when zero-padding the destination string. However, the v18 reference was mistakenly being used for stxv and stxvl, which take a VSX vector as operand. The code ended up using the uninitialized VSR 18 register by mistake. Both occurrences have been changed to use the proper VSX number for VR 18 (i.e. VSR 50). Tested on powerpc, powerpc64 and powerpc64le. Signed-off-by: Kewen Lin <linkw@gcc.gnu.org>