about summary refs log tree commit diff
path: root/sysdeps
Commit message (Collapse)AuthorAgeFilesLines
* x86: Don't set Prefer_No_AVX512 for processors with AVX512 and AVX-VNNIH.J. Lu2021-12-061-2/+5
| | | | | | Don't set Prefer_No_AVX512 on processors with AVX512 and AVX-VNNI since they won't lower CPU frequency when ZMM load and store instructions are used.
* linux: Add generic ioctl implementationAdhemerval Zanella2021-12-063-31/+85
| | | | The powerpc is refactor to use the default implementation.
* linux: Add generic syscall implementationAdhemerval Zanella2021-12-064-67/+65
| | | | | It allows also to remove hppa specific implementation and simplify riscv implementation a bit.
* csu: Always use __executable_start in gmon-start.cFlorian Weimer2021-12-053-63/+0
| | | | | | | | | | | | | | Current binutils defines __executable_start as the lowest text address, so using the entry point address as a fallback is no longer necessary. As a result, overriding <entry.h> is only necessary if the entry point is not called _start. The previous approach to define __ASSEMBLY__ to suppress the declaration breaks if headers included by <entry.h> are not compatible with __ASSEMBLY__. This happens with rseq integration because it is necessary to include kernel headers in more places. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* elf: execve statically linked programs instead of crashing [BZ #28648]Florian Weimer2021-12-052-0/+50
| | | | | | | | | | | | | | | Programs without dynamic dependencies and without a program interpreter are now run via execve. Previously, the dynamic linker either crashed while attempting to read a non-existing dynamic segment (looking for DT_AUDIT/DT_DEPAUDIT data), or the self-relocated in the static PIE executable crashed because the outer dynamic linker had already applied RELRO protection. <dl-execve.h> is needed because execve is not available in the dynamic loader on Hurd. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Use notl in EVEX strcmp [BZ #28646]Noah Goldstein2021-12-031-6/+8
| | | | | | | | Must use notl %edi here as lower bits are for CHAR comparisons potentially out of range thus can be 0 without indicating mismatch. This fixes BZ #28646. Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
* nptl: Increase default TCB alignment to 32Florian Weimer2021-12-0316-49/+0
| | | | | | | | | | rseq support will use a 32-byte aligned field in struct pthread, so the whole struct needs to have at least that alignment. nptl/tst-tls3mod.c uses TCB_ALIGNMENT, therefore include <descr.h> to obtain the fallback definition. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* AArch64: Improve A64FX memcpyWilco Dijkstra2021-12-021-321/+225
| | | | | | | | | | | | | v2 is a complete rewrite of the A64FX memcpy. Performance is improved by streamlining the code, aligning all large copies and using a single unrolled loop for all sizes. The code size for memcpy and memmove goes down from 1796 bytes to 868 bytes. Performance is better in all cases: bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33% faster for large sizes, bench-memcpy-walk is 25% faster for small sizes and 20% for the largest sizes. The geomean of all tests in bench-memcpy is 5.1% faster, and total time is reduced by 4%. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* AArch64: Optimize memcmpWilco Dijkstra2021-12-021-107/+134
| | | | | | | | Rewrite memcmp to improve performance. On small and medium inputs performance is 10-20% better. Large inputs use a SIMD loop processing 64 bytes per iteration, which is 30-50% faster depending on the size. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* powerpc64[le]: Fix CFI and LR save address for asm syscalls [BZ #28532]Matheus Castanho2021-11-301-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syscalls based on the assembly templates are missing CFI for r31, which gets clobbered when scv is used, and info for LR is inaccurate, placed in the wrong LOC and not using the proper offset. LR was also being saved to the callee's frame, while the ABI mandates it to be saved to the caller's frame. These are fixed by this commit. After this change: $ readelf -wF libc.so.6 | grep 0004b9d4.. -A 7 && objdump --disassemble=kill libc.so.6 00004a48 0000000000000020 00004a4c FDE cie=00000000 pc=000000000004b9d4..000000000004ba3c LOC CFA r31 ra 000000000004b9d4 r1+0 u u 000000000004b9e4 r1+48 u u 000000000004b9e8 r1+48 c-16 u 000000000004b9fc r1+48 c-16 c+16 000000000004ba08 r1+48 c-16 000000000004ba18 r1+48 u 000000000004ba1c r1+0 u libc.so.6: file format elf64-powerpcle Disassembly of section .text: 000000000004b9d4 <kill>: 4b9d4: 1f 00 4c 3c addis r2,r12,31 4b9d8: 2c c3 42 38 addi r2,r2,-15572 4b9dc: 25 00 00 38 li r0,37 4b9e0: d1 ff 21 f8 stdu r1,-48(r1) 4b9e4: 20 00 e1 fb std r31,32(r1) 4b9e8: 98 8f ed eb ld r31,-28776(r13) 4b9ec: 10 00 ff 77 andis. r31,r31,16 4b9f0: 1c 00 82 41 beq 4ba0c <kill+0x38> 4b9f4: a6 02 28 7d mflr r9 4b9f8: 40 00 21 f9 std r9,64(r1) 4b9fc: 01 00 00 44 scv 0 4ba00: 40 00 21 e9 ld r9,64(r1) 4ba04: a6 03 28 7d mtlr r9 4ba08: 08 00 00 48 b 4ba10 <kill+0x3c> 4ba0c: 02 00 00 44 sc 4ba10: 00 00 bf 2e cmpdi cr5,r31,0 4ba14: 20 00 e1 eb ld r31,32(r1) 4ba18: 30 00 21 38 addi r1,r1,48 4ba1c: 18 00 96 41 beq cr5,4ba34 <kill+0x60> 4ba20: 01 f0 20 39 li r9,-4095 4ba24: 40 48 23 7c cmpld r3,r9 4ba28: 20 00 e0 4d bltlr+ 4ba2c: d0 00 63 7c neg r3,r3 4ba30: 08 00 00 48 b 4ba38 <kill+0x64> 4ba34: 20 00 e3 4c bnslr+ 4ba38: c8 32 fe 4b b 2ed00 <__syscall_error> ... 4ba44: 40 20 0c 00 .long 0xc2040 4ba48: 68 00 00 00 .long 0x68 4ba4c: 06 00 5f 5f rlwnm r31,r26,r0,0,3 4ba50: 6b 69 6c 6c xoris r12,r3,26987
* linux: Implement pipe in terms of __NR_pipe2Adhemerval Zanella2021-11-3010-221/+3
| | | | | | | | | The syscall pipe2 was added in linux 2.6.27 and glibc requires linux 3.2.0. The patch removes the arch-specific implementation for alpha, ia64, mips, sh, and sparc which requires a different kernel ABI than the usual one. Checked on x86_64-linux-gnu and with a build for the affected ABIs.
* linux: Implement mremap in CAdhemerval Zanella2021-11-303-1/+42
| | | | | | | | | Variadic function calls in syscalls.list does not work for all ABIs (for instance where the argument are passed on the stack instead of registers) and might have underlying issues depending of the variadic type (for instance if a 64-bit argument is used). Checked on x86_64-linux-gnu.
* linux: Add prlimit64 C implementationAdhemerval Zanella2021-11-3018-27/+47
| | | | | | | | | | | The LFS prlimit64 requires a arch-specific implementation in syscalls.list. Instead add a generic one that handles the required symbol alias for __RLIM_T_MATCHES_RLIM64_T. HPPA is the only outlier which requires a different default symbol. Checked on x86_64-linux-gnu and with build for the affected ABIs.
* linux: Use /proc/stat fallback for __get_nprocs_conf (BZ #28624)Adhemerval Zanella2021-11-251-25/+35
| | | | | | | The /proc/statm fallback was removed by f13fb81ad3159 if sysfs is not available, reinstate it. Checked on x86_64-linux-gnu.
* linux: Add fanotify_mark C implementationAdhemerval Zanella2021-11-2518-22/+42
| | | | | | | | | Passing 64-bit arguments on syscalls.list is tricky: it requires to reimplement the expected kernel abi in each architecture. This is way to better to represent in C code where we already have macros for this (SYSCALL_LL64). Checked on x86_64-linux-gnu.
* linux: Only build fstatat fallback if requiredAdhemerval Zanella2021-11-251-7/+11
| | | | | | | For 32-bit architecture with __ASSUME_STATX there is no need to build fstatat64_time64_stat. Checked on i686-linux-gnu.
* x86-64: Add vector sin/sinf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | Add vector sin/sinf and input files to libmvec microbenchmark. libmvec-sin-inputs: 90% Normal random distribution range: (-DBL_MAX, DBL_MAX) mean: 0.0 sigma: 5.0 10% uniform random distribution in range (-1000.0, 1000.0) libmvec-sinf-inputs: 90% Normal random distribution range: (-FLT_MAX, FLT_MAX) mean: 0.0f sigma: 5.0f 10% uniform random distribution in range (-1000.0f, 1000.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector pow/powf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add vector pow/powf and input files to libmvec microbenchmark. libmvec-pow-inputs: arg1: 90% Normal random distribution range: (0.0, 256.0) mean: 0.0 sigma: 32.0 10% uniform random distribution in range (0.0, 256.0) arg2: 90% Normal random distribution range: (-127.0, 127.0) mean: 0.0 sigma: 16.0 10% uniform random distribution in range (-127.0, 127.0) libmvec-powf-inputs: arg1: 90% Normal random distribution range: (0.0f, 100.0f) mean: 0.0f sigma: 16.0f 10% uniform random distribution in range (0.0f, 100.0f) arg2: 90% Normal random distribution range: (-10.0f, 10.0f) mean: 0.0f sigma: 8.0f 10% uniform random distribution in range (-10.0f, 10.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector log/logf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | Add vector log/logf and input files to libmvec microbenchmark. libmvec-log-inputs: 70% Normal random distribution range: (0.0, DBL_MAX) mean: 1.0 sigma: 50.0 30% uniform random distribution in range (0.0, DBL_MAX) libmvec-logf-inputs: 70% Normal random distribution range: (0.0f, FLT_MAX) mean: 1.0f sigma: 50.0f 30% uniform random distribution in range (0.0f, FLT_MAX) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector exp/expf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | Add vector exp/expf and input files to libmvec microbenchmark. libmvec-exp-inputs: 90% Normal random distribution range: (-708.0, 709.0) mean: 0.0 sigma: 16.0 10% uniform random distribution in range (-500.0, 500.0) libmvec-expf-inputs: 90% Normal random distribution range: (-87.0f, 88.0f) mean: 0.0f sigma: 8.0f 10% uniform random distribution in range (-50.0f, 50.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86-64: Add vector cos/cosf to libmvec microbenchmarkSunil K Pandey2021-11-243-0/+8201
| | | | | | | | | | | | | | | | | | | | Add vector cos/cosf and input files to libmvec microbenchmark. libmvec-cos-inputs: 90% Normal random distribution range: (-DBL_MAX, DBL_MAX) mean: 0.0 sigma: 5.0 10% uniform random distribution in range (-1000.0, 1000.0) libmvec-cosf-inputs: 90% Normal random distribution range: (-FLT_MAX, FLT_MAX) mean: 0.0f sigma: 5.0f 10% uniform random distribution in range (-1000.0f, 1000.0f) Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* io: Refactor close_range and closefromAdhemerval Zanella2021-11-2411-375/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | Now that Hurd implementis both close_range and closefrom (f2c996597d), we can make close_range() a base ABI, and make the default closefrom() implementation on top of close_range(). The generic closefrom() implementation based on __getdtablesize() is moved to generic close_range(). On Linux it will be overriden by the auto-generation syscall while on Hurd it will be a system specific implementation. The closefrom() now calls close_range() and __closefrom_fallback(). Since on Hurd close_range() does not fail, __closefrom_fallback() is an empty static inline function set by__ASSUME_CLOSE_RANGE. The __ASSUME_CLOSE_RANGE also allows optimize Linux __closefrom_fallback() implementation when --enable-kernel=5.9 or higher is used. Finally the Linux specific tst-close_range.c is moved to io and enabled as default. The Linuxism and CLOSE_RANGE_UNSHARE are guarded so it can be built for Hurd (I have not actually test it). Checked on x86_64-linux-gnu, i686-linux-gnu, and with a i686-gnu build.
* nptl: Do not set signal mask on second setjmp return [BZ #28607]Florian Weimer2021-11-242-0/+46
| | | | | | | | | | | | __libc_signal_restore_set was in the wrong place: It also ran when setjmp returned the second time (after pthread_exit or pthread_cancel). This is observable with blocked pending signals during thread exit. Fixes commit b3cae39dcbfa2432b3f3aa28854d8ac57f0de1b8 ("nptl: Start new threads with all signals blocked [BZ #25098]"). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc: Define USE_PPC64_NOTOC iff compiler supports itAdhemerval Zanella2021-11-222-24/+43
| | | | | | | | | | | | | | | | | | The @notoc usage only yields an advantage on ISA 3.1+ machine (power10) and for ld.bfd also when it sees pcrel relocations used on the code (generated if compiler targets ISA 3.1+). On bfd case ISA 3.1+ instruction on stubs are used iff linker also sees the new pc-relative relocations (for instance R_PPC64_D34), otherwise it generates default stubs (ppc64_elf_check_relocs:4700). This patch also help on linkers that do not implement this optimization, since building for older ISA (such as 3.0 / power9) will also trigger power10 stubs generation in the assembly code uses the NOTOC imacro. Checked on powerpc64le-linux-gnu. Reviewed-by: Fangrui Song <maskray@google.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* setjmp: Replace jmp_buf-macros.h with jmp_buf-macros.symAdhemerval Zanella2021-11-2229-264/+1
| | | | | | | | | | | | It requires less boilerplate code for newer ports. The _Static_assert checks from internal setjmp are moved to its own internal test since setjmp.h is included early by multiple headers (to generate rtld-sizes.sym). The riscv jmp_buf-macros.h check is also redundant, it is already done by riscv configure.ac. Checked with a build for the affected architectures.
* Update kernel version to 5.15 in tst-mman-consts.pyJoseph Myers2021-11-221-1/+1
| | | | | | | | This patch updates the kernel version in the test tst-mman-consts.py to 5.15. (There are no new MAP_* constants covered by this test in 5.15 that need any other header changes.) Tested with build-many-glibcs.py.
* Add PF_MCTP, AF_MCTP from Linux 5.15 to bits/socket.hJoseph Myers2021-11-171-1/+3
| | | | | | | Linux 5.15 adds a new address / protocol family PF_MCTP / AF_MCTP; add these constants to bits/socket.h. Tested for x86_64.
* elf: Introduce GLRO (dl_libc_freeres), called from __libc_freeresFlorian Weimer2021-11-171-0/+7
| | | | | | | This will be used to deallocate memory allocated using the non-minimal malloc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* nptl: Extract <bits/atomic_wide_counter.h> from pthread_cond_common.cFlorian Weimer2021-11-171-18/+4
| | | | | | | | | | | | | And make it an installed header. This addresses a few aliasing violations (which do not seem to result in miscompilation due to the use of atomics), and also enables use of wide counters in other parts of the library. The debug output in nptl/tst-cond22 has been adjusted to print the 32-bit values instead because it avoids a big-endian/little-endian difference. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86-64: Create microbenchmark infrastructure for libmvecSunil K Pandey2021-11-164-0/+642
| | | | | | | | | | | | | Add python script to generate libmvec microbenchmark from the input values for each libmvec function using skeleton benchmark template. Creates double and float benchmarks with vector length 1, 2, 4, 8, and 16 for each libmvec function. Vector length 1 corresponds to scalar version of function and is included for vector function perf comparison. Co-authored-by: Haochen Jiang <haochen.jiang@intel.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Shrink memcmp-sse4.S code sizeNoah Goldstein2021-11-101-1621/+646
| | | | | | | | | | | | | | | | | | | | | | | | No bug. This implementation refactors memcmp-sse4.S primarily with minimizing code size in mind. It does this by removing the lookup table logic and removing the unrolled check from (256, 512] bytes. memcmp-sse4 code size reduction : -3487 bytes wmemcmp-sse4 code size reduction: -1472 bytes The current memcmp-sse4.S implementation has a large code size cost. This has serious adverse affects on the ICache / ITLB. While in micro-benchmarks the implementations appears fast, traces of real-world code have shown that the speed in micro benchmarks does not translate when the ICache/ITLB are not primed, and that the cost of the code size has measurable negative affects on overall application performance. See https://research.google/pubs/pub48320/ for more details. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* Update syscall lists for Linux 5.15Joseph Myers2021-11-1028-4/+31
| | | | | | | | | | | | | | Linux 5.15 has one new syscall, process_mrelease (and also enables the clone3 syscall for RV32). It also has a macro __NR_SYSCALL_MASK for Arm, which is not a syscall but matches the pattern used for syscall macro names. Add __NR_SYSCALL_MASK to the names filtered out in the code dealing with syscall lists, update syscall-names.list for the new syscall and regenerate the arch-syscall.h headers with build-many-glibcs.py update-syscalls. Tested with build-many-glibcs.py.
* s390: Use long branches across object boundaries (jgh instead of jh)Florian Weimer2021-11-102-2/+2
| | | | | | | | | | Depending on the layout chosen by the linker, the 16-bit displacement of the jh instruction is insufficient to reach the target label. Analysis of the linker failure was carried out by Nick Clifton. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Stefan Liebler <stli@linux.ibm.com>
* Remove the unused +mkdep/+make-deps/s-proto.S/s-proto-cancel.SH.J. Lu2021-11-103-13/+0
| | | | | | | | | | | | | | | Since commit d73f5331ce5370ca5a879229e3842f5de98689cd Author: Roland McGrath <roland@gnu.org> Date: Fri May 2 02:20:45 2003 +0000 2003-05-01 Roland McGrath <roland@redhat.com> dependency is generated by passing -MD -MF to compiler. Remove the unused +mkdep, +make-deps, s-proto.S and s-proto-cancel.S. This fixes BZ #28554.
* Fix build a chec failures after b05fae4d8e34Adhemerval Zanella2021-11-091-1/+0
| | | | | | | | | | The include cleanup on dl-minimal.c removed too much for some targets. Also for Hurd, __sbrk is removed from localplt.data now that tunables allocated memory through mmap. Checked with a build for all affected architectures.
* elf: Use the minimal malloc on tunables_strdupAdhemerval Zanella2021-11-091-0/+28
| | | | | | | | | | | | | | | | | | | | The rtld_malloc functions are moved to its own file so it can be used on csu code. Also, the functiosn are renamed to __minimal_* (since there are now used not only on loader code). Using the __minimal_malloc on tunables_strdup() avoids potential issues with sbrk() calls while processing the tunables (I see sporadic elf/tst-dso-ordering9 on powerpc64le with different tests failing due ASLR). Also, using __minimal_malloc over plain mmap optimizes the memory allocation on both static and dynamic case (since it will any unused space in either the last page of data segments, avoiding mmap() call, or from the previous mmap() call). Checked on x86_64-linux-gnu, i686-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* hurd: Remove unused __libc_close_rangeSamuel Thibault2021-11-071-1/+0
| | | | That was just cargo-culted.
* hurd: Implement close_range and closefromSergey Bugaev2021-11-076-1/+135
| | | | | | | | | | | | | | | | | | The close_range () function implements the same API as the Linux and FreeBSD syscalls. It operates atomically and reliably. The specified upper bound is clamped to the actual size of the file descriptor table; it is expected that the most common use case is with last = UINT_MAX. Like in the Linux syscall, it is also possible to pass the CLOSE_RANGE_CLOEXEC flag to mark the file descriptors in the range cloexec instead of acually closing them. Also, add a Hurd version of the closefrom () function. Since unlike on Linux, close_range () cannot fail due to being unuspported by the running kernel, a fallback implementation is never necessary. Signed-off-by: Sergey Bugaev <bugaevc@gmail.com> Message-Id: <20211106153524.82700-1-bugaevc@gmail.com>
* x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.hNoah Goldstein2021-11-062-14/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No bug. This patch doubles the rep_movsb_threshold when using ERMS. Based on benchmarks the vector copy loop, especially now that it handles 4k aliasing, is better for these medium ranged. On Skylake with ERMS: Size, Align1, Align2, dst>src,(rep movsb) / (vec copy) 4096, 0, 0, 0, 0.975 4096, 0, 0, 1, 0.953 4096, 12, 0, 0, 0.969 4096, 12, 0, 1, 0.872 4096, 44, 0, 0, 0.979 4096, 44, 0, 1, 0.83 4096, 0, 12, 0, 1.006 4096, 0, 12, 1, 0.989 4096, 0, 44, 0, 0.739 4096, 0, 44, 1, 0.942 4096, 12, 12, 0, 1.009 4096, 12, 12, 1, 0.973 4096, 44, 44, 0, 0.791 4096, 44, 44, 1, 0.961 4096, 2048, 0, 0, 0.978 4096, 2048, 0, 1, 0.951 4096, 2060, 0, 0, 0.986 4096, 2060, 0, 1, 0.963 4096, 2048, 12, 0, 0.971 4096, 2048, 12, 1, 0.941 4096, 2060, 12, 0, 0.977 4096, 2060, 12, 1, 0.949 8192, 0, 0, 0, 0.85 8192, 0, 0, 1, 0.845 8192, 13, 0, 0, 0.937 8192, 13, 0, 1, 0.939 8192, 45, 0, 0, 0.932 8192, 45, 0, 1, 0.927 8192, 0, 13, 0, 0.621 8192, 0, 13, 1, 0.62 8192, 0, 45, 0, 0.53 8192, 0, 45, 1, 0.516 8192, 13, 13, 0, 0.664 8192, 13, 13, 1, 0.659 8192, 45, 45, 0, 0.593 8192, 45, 45, 1, 0.575 8192, 2048, 0, 0, 0.854 8192, 2048, 0, 1, 0.834 8192, 2061, 0, 0, 0.863 8192, 2061, 0, 1, 0.857 8192, 2048, 13, 0, 0.63 8192, 2048, 13, 1, 0.629 8192, 2061, 13, 0, 0.627 8192, 2061, 13, 1, 0.62 Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* x86: Optimize memmove-vec-unaligned-erms.SNoah Goldstein2021-11-066-224/+381
| | | | | | | | | | | | | | | | | | | | | | No bug. The optimizations are as follows: 1) Always align entry to 64 bytes. This makes behavior more predictable and makes other frontend optimizations easier. 2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have significant benefits in the case that: 0 < (dst - src) < [256, 512] 3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%] improvement and for FSRM [-10%, 25%]. In addition to these primary changes there is general cleanup throughout to optimize the aligning routines and control flow logic. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* [powerpc] Tighten contraints for asm constant parametersPaul A. Clarke2021-11-033-9/+9
| | | | | | | | | | | | There are a few places where only known numeric values are acceptable for `asm` parameters, yet the constraint "i" is used. "i" can include "symbolic constants whose values will be known only at assembly time or later." Use "n" instead of "i" where known numeric values are required. Suggested-by: Segher Boessenkool <segher@kernel.crashing.org> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* riscv: Build with -mno-relax if linker does not support R_RISCV_ALIGNAdhemerval Zanella2021-11-033-0/+52
| | | | | | | | | | | It allows build both glibc and tests with lld (Since lld does not support R_RISCV_ALIGN linker relaxation). Checked with a build for riscv32-linux-gnu-rv32imafdc-ilp32d and riscv64-linux-gnu-rv64imafdc-lp64d. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Fangrui Song <maskray@google.com>
* x86-64: Replace movzx with movzblFangrui Song2021-11-022-4/+4
| | | | | | | | | | | | | Clang cannot assemble movzx in the AT&T dialect mode. ../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction movzx (%rsi), %ecx ^~~~ Change movzx to movzbl, which follows the AT&T dialect and is used elsewhere in the file. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* i386: Explain why __HAVE_64B_ATOMICS has to be 0Florian Weimer2021-11-021-0/+4
|
* arm: Use have-mtls-dialect-gnu2 to check for ARM TLS descriptors supportAdhemerval Zanella2021-11-011-6/+1
| | | | | | | The lld linker does not support TLSDESC for arm. The have-arm-tls-desc is a leftover of 56583289b1 to support NaCL. Reviewed-by: Fangrui Song <maskray@google.com>
* arm: Use internal symbol for _dl_argv on _dl_start_userAdhemerval Zanella2021-11-011-1/+1
| | | | | | | | | | | | The lld does not support R_ARM_GOTOFF32 to preemptible symbol (_dl_argv has default visibility). Use the internal alias instead (one option would to use HIDDEN_JUMPTARGET, bu the macro is not defined for !__ASSEMBLER__ and I made this patch arm-specific to avoid require to check extensivelly on other architecture it this might break something). Checked on arm-linux-gnueabihf. Reviewed-by: Fangrui Song <maskray@google.com>
* x86-64: Remove Prefer_AVX2_STRCMPH.J. Lu2021-11-015-15/+2
| | | | | | | | | Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1 VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1 VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake.
* x86-64: Improve EVEX strcmp with masked loadH.J. Lu2021-11-011-218/+243
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In strcmp-evex.S, to compare 2 32-byte strings, replace VMOVU (%rdi, %rdx), %YMM0 VMOVU (%rsi, %rdx), %YMM1 /* Each bit in K0 represents a mismatch in YMM0 and YMM1. */ VPCMP $4, %YMM0, %YMM1, %k0 VPCMP $0, %YMMZERO, %YMM0, %k1 VPCMP $0, %YMMZERO, %YMM1, %k2 /* Each bit in K1 represents a NULL in YMM0 or YMM1. */ kord %k1, %k2, %k1 /* Each bit in K1 represents a NULL or a mismatch. */ kord %k0, %k1, %k1 kmovd %k1, %ecx testl %ecx, %ecx jne L(last_vector) with VMOVU (%rdi, %rdx), %YMM0 VPTESTM %YMM0, %YMM0, %k2 /* Each bit cleared in K1 represents a mismatch or a null CHAR in YMM0 and 32 bytes at (%rsi, %rdx). */ VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2} kmovd %k1, %ecx incl %ecx jne L(last_vector) It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake and Ice Lake. Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
* Fix compiler issue with mmap_internalStafford Horne2021-10-291-0/+2
| | | | | | | | | | | | | | Compiling mmap_internal fails to compile when we use -1 for MMAP2_PAGE_UNIT on 32 bit architectures. The error is as follows: ../sysdeps/unix/sysv/linux/mmap_internal.h:30:8: error: unknown type name 'uint64_t' | 30 | static uint64_t page_unit; | | ^~~~~~~~ Fix by adding including stdint.h.
* x86_64: Add memcmpeq.S to fix disable-multi-arch buildNoah Goldstein2021-10-281-0/+21
| | | | | | | | | | | | | | | | The following commit: commit cf4fd28ea453d1a9cec93939bc88b58ccef5437a Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Tue Oct 26 19:43:18 2021 -0500 Broke --disable-multi-arch build for x86_64 because x86_64/memcmpeq.S was not defined outside of multiarch and the alias for __memcmpeq in x86_64/memcmp.S was removed. This commit fixes that issue by adding x86_64/memcmpeq.S. make xcheck passes on x86_64 with and without --disable-multi-arch