about summary refs log tree commit diff
path: root/sysdeps/x86
Commit message (Collapse)AuthorAgeFilesLines
* x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`Noah Goldstein2023-06-121-27/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 / ncores_per_socket'. This patch updates that value to roughly 'sizeof_L3 / 4` The original value (specifically dividing the `ncores_per_socket`) was done to limit the amount of other threads' data a `memcpy`/`memset` could evict. Dividing by 'ncores_per_socket', however leads to exceedingly low non-temporal thresholds and leads to using non-temporal stores in cases where REP MOVSB is multiple times faster. Furthermore, non-temporal stores are written directly to main memory so using it at a size much smaller than L3 can place soon to be accessed data much further away than it otherwise could be. As well, modern machines are able to detect streaming patterns (especially if REP MOVSB is used) and provide LRU hints to the memory subsystem. This in affect caps the total amount of eviction at 1/cache_associativity, far below meaningfully thrashing the entire cache. As best I can tell, the benchmarks that lead this small threshold where done comparing non-temporal stores versus standard cacheable stores. A better comparison (linked below) is to be REP MOVSB which, on the measure systems, is nearly 2x faster than non-temporal stores at the low-end of the previous threshold, and within 10% for over 100MB copies (well past even the current threshold). In cases with a low number of threads competing for bandwidth, REP MOVSB is ~2x faster up to `sizeof_L3`. The divisor of `4` is a somewhat arbitrary value. From benchmarks it seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs such as Broadwell prefer something closer to `8`. This patch is meant to be followed up by another one to make the divisor cpu-specific, but in the meantime (and for easier backporting), this patch settles on `4` as a middle-ground. Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable stores where done using: https://github.com/goldsteinn/memcpy-nt-benchmarks Sheets results (also available in pdf on the github): https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml Reviewed-by: DJ Delorie <dj@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* Fix a few more typos I missed in previous round -- BZ 25337Paul Pluzhnikov2023-06-021-1/+1
|
* Fix misspellings in sysdeps/ -- BZ 25337Paul Pluzhnikov2023-05-303-3/+3
|
* x86: Use 64MB as nt-store threshold if no cacheinfo [BZ #30429]Noah Goldstein2023-05-271-1/+9
| | | | | | | | | | | | If `non_temporal_threshold` is below `minimum_non_temporal_threshold`, it almost certainly means we failed to read the systems cache info. In this case, rather than defaulting the minimum correct value, we should default to a value that gets at least reasonable performance. 64MB is chosen conservatively to be at the very high end. This should never cause non-temporal stores when, if we had read cache info, we wouldn't have otherwise. Reviewed-by: Florian Weimer <fweimer@redhat.com>
* <sys/platform/x86.h>: Add PREFETCHI supportH.J. Lu2023-04-054-0/+7
| | | | | Add PREFETCHI support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AMX-COMPLEX supportH.J. Lu2023-04-054-0/+8
| | | | | Add AMX-COMPLEX support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AVX-NE-CONVERT supportH.J. Lu2023-04-054-0/+8
| | | | | Add AVX-NE-CONVERT support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AVX-VNNI-INT8 supportH.J. Lu2023-04-054-0/+17
| | | | | Add AVX-VNNI-INT8 support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add MSRLIST supportH.J. Lu2023-04-052-0/+2
| | | | | Add MSRLIST support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AVX-IFMA supportH.J. Lu2023-04-054-0/+8
| | | | | Add AVX-IFMA support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AMX-FP16 supportH.J. Lu2023-04-054-0/+8
| | | | | Add AMX-FP16 support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add WRMSRNS supportH.J. Lu2023-04-052-0/+2
| | | | | Add WRMSRNS support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add ArchPerfmonExt supportH.J. Lu2023-04-052-0/+2
| | | | | | Add Architectural Performance Monitoring Extended Leaf (EAX = 23H) support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add CMPCCXADD supportH.J. Lu2023-04-054-0/+7
| | | | | Add CMPCCXADD support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add LASS supportH.J. Lu2023-04-052-0/+2
| | | | | Add Linear Address Space Separation (LASS) support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add RAO-INT supportH.J. Lu2023-04-054-0/+7
| | | | | Add RAO-INT support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add LBR supportH.J. Lu2023-04-052-1/+2
| | | | | Add architectural LBR support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add RTM_FORCE_ABORT supportH.J. Lu2023-04-052-1/+2
| | | | | Add RTM_FORCE_ABORT support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add SGX-KEYS supportH.J. Lu2023-04-052-1/+2
| | | | | Add SGX-KEYS support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add BUS_LOCK_DETECT supportH.J. Lu2023-04-052-1/+2
| | | | | | Add Bus lock debug exceptions (BUS_LOCK_DETECT) support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add LA57 supportH.J. Lu2023-04-052-1/+2
| | | | | | Add 57-bit linear addresses and five-level paging (LA57) support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <bits/platform/x86.h>: Rename to x86_cpu_INDEX_7_ECX_15H.J. Lu2023-04-051-1/+1
| | | | | | Rename x86_cpu_INDEX_7_ECX_1 to x86_cpu_INDEX_7_ECX_15 for the unused bit 15 in ECX from CPUID with EAX == 0x7 and ECX == 0. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86/dl-cacheinfo: remove unsused parameter from handle_amdAndreas Schwab2023-04-041-36/+30
| | | | Also replace an unreachable assert with __builtin_unreachable.
* x86: Set FSGSBASE to active if enabled by kernelH.J. Lu2023-04-033-0/+31
| | | | | | | | | | | | | Linux kernel uses AT_HWCAP2 to indicate if FSGSBASE instructions are enabled. If the HWCAP2_FSGSBASE bit in AT_HWCAP2 is set, FSGSBASE instructions can be used in user space. Define dl_check_hwcap2 to set the FSGSBASE feature to active on Linux when the HWCAP2_FSGSBASE bit is set. Add a test to verify that FSGSBASE is active on current kernels. NB: This test will fail if the kernel doesn't set the HWCAP2_FSGSBASE bit in AT_HWCAP2 while fsgsbase shows up in /proc/cpuinfo. Reviewed-by: Florian Weimer <fweimer@redhat.com>
* Remove --enable-tunables configure optionAdhemerval Zanella Netto2023-03-295-68/+29
| | | | | | | | | | | | And make always supported. The configure option was added on glibc 2.25 and some features require it (such as hwcap mask, huge pages support, and lock elisition tuning). It also simplifies the build permutations. Changes from v1: * Remove glibc.rtld.dynamic_sort changes, it is orthogonal and needs more discussion. * Cleanup more code. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* x86: Don't check PREFETCHWT1 in tst-cpu-features-cpuinfo.cDJ Delorie2023-03-211-0/+3
| | | | | | | Don't check PREFETCHWT1 against /proc/cpuinfo since kernel doesn't report PREFETCHWT1 in /proc/cpuinfo. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Fix bug about glibc.cpu.hwcaps.caiyinyu2023-03-071-3/+3
| | | | | | | | | | | | | | | | | | | Recorded in [BZ #30183]: 1. export GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX512 2. Add _dl_printf("p -- %s\n", p); just before switch(nl) in sysdeps/x86/cpu-tunables.c 3. compiled and run ./testrun.sh /usr/bin/ls you will get: p -- -AVX512 p -- LC_ADDRESS=en_US.UTF-8 p -- LC_NUMERIC=C ... The function, TUNABLE_CALLBACK (set_hwcaps) (tunable_val_t *valp), checks far more than it should and it should stop at end of "-AVX512".
* x86-64: Add glibc.cpu.prefer_map_32bit_exec [BZ #28656]H.J. Lu2023-02-222-0/+16
| | | | | | | | | | | | | | | | | | | Crossing 2GB boundaries with indirect calls and jumps can use more branch prediction resources on Intel Golden Cove CPU (see the "Misprediction for Branches >2GB" section in Intel 64 and IA-32 Architectures Optimization Reference Manual.) There is visible performance improvement on workloads with many PLT calls when executable and shared libraries are mmapped below 2GB. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable or denywrite pages in shared libraries with MAP_32BIT first. NB: Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by the tunable, glibc.cpu.prefer_map_32bit_exec, or the environment variable, LD_PREFER_MAP_32BIT_EXEC. This works only between shared libraries or between shared libraries and executables with addresses below 2GB. PIEs are usually loaded at a random address above 4GB by the kernel.
* string: Remove string_private.hAdhemerval Zanella2023-02-171-20/+0
| | | | | | Now that _STRING_ARCH_unaligned is not used anymore. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
* htl: Generalize i386 pt-machdep.h to x86Samuel Thibault2023-02-121-0/+28
|
* x86: Cache computation for AMD architecture.Sajan Karumanchi2023-01-181-159/+45
| | | | | | | | All AMD architectures cache details will be computed based on __cpuid__ `0x8000_001D` and the reference to __cpuid__ `0x8000_0006` will be zeroed out for future architectures. Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
* Update copyright dates with scripts/update-copyrightsJoseph Myers2023-01-06121-121/+121
|
* x86: Check minimum/maximum of non_temporal_threshold [BZ #29953]H.J. Lu2023-01-031-9/+16
| | | | | | | | | The minimum non_temporal_threshold is 0x4040. non_temporal_threshold may be set to less than the minimum value when the shared cache size isn't available (e.g., in an emulator) or by the tunable. Add checks for minimum and maximum of non_temporal_threshold. This fixes BZ #29953.
* x86_64: State assembler is being tested on sysdeps/x86/configureAdhemerval Zanella2022-12-062-3/+3
|
* configure: Remove AS checkAdhemerval Zanella2022-12-062-3/+3
| | | | | | The assembler is not issued directly, but rather always through CC wrapper. The binutils version check if done with LD instead. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* elf: Remove _dl_string_hwcapJavier Pello2022-10-061-14/+0
| | | | | | | | Removal of legacy hwcaps support from the dynamic loader left no users of _dl_string_hwcap. Signed-off-by: Javier Pello <devel@otheo.eu> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementationsAurelien Jarno2022-10-031-0/+1
| | | | | | | | | | | The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk' instruction which belongs to the BMI1 CPU feature and the 'shrx' instruction, which belongs to the BMI2 CPU feature. Fixes: df7e295d18ff ("x86: Optimize {str|wcs}rchr-avx2") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementationAurelien Jarno2022-10-031-0/+1
| | | | | | | | | | | The AVX2 memrchr implementation uses the 'shlxl' instruction, which belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which belongs to the LZCNT CPU feature. Fixes: af5306a735eb ("x86: Optimize memrchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: include BMI1 and BMI2 in x86-64-v3 levelAurelien Jarno2022-10-031-0/+2
| | | | | | | | The "System V Application Binary Interface AMD64 Architecture Processor Supplement" mandates the BMI1 and BMI2 CPU features for the x86-64-v3 level. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* Update _FloatN header support for C++ in GCC 13Joseph Myers2022-09-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC 13 adds support for _FloatN and _FloatNx types in C++, so breaking the installed glibc headers that assume such support is not present. GCC mostly works around this with fixincludes, but that doesn't help for building glibc and its tests (glibc doesn't itself contain C++ code, but there's C++ code built for tests). Update glibc's bits/floatn-common.h and bits/floatn.h headers to handle the GCC 13 support directly. In general the changes match those made by fixincludes, though I think the ones in sysdeps/powerpc/bits/floatn.h, where the header tests __LDBL_MANT_DIG__ == 113 or uses #elif, wouldn't match the existing fixincludes patterns. Some places involving special C++ handling in relation to _FloatN support are not changed. There's no need to change the __HAVE_FLOATN_NOT_TYPEDEF definition (also in a form that wouldn't be matched by the fixincludes fixes) because it's only used in relation to macro definitions using features not supported for C++ (__builtin_types_compatible_p and _Generic). And there's no need to change the inline function overloads for issignaling, iszero and iscanonical in C++ because cases where types have the same format but are no longer compatible types are handled automatically by the C++ overload resolution rules. This patch also does not change the overload handling for iseqsig, and there I think changes *are* needed, beyond those in this patch or made by fixincludes. The way that overload is defined, via a template parameter to a structure type, requires overloads whenever the types are incompatible, even if they have the same format. So I think we need to add overloads with GCC 13 for every supported _FloatN and _FloatNx type, rather than just having one for _Float128 when it has a different ABI to long double as at present (but for older GCC, such overloads must not be defined for types that end up defined as typedefs for another type). Tested with build-many-glibcs.py: compilers build for aarch64-linux-gnu ia64-linux-gnu mips64-linux-gnu powerpc-linux-gnu powerpc64le-linux-gnu x86_64-linux-gnu; glibcs build for aarch64-linux-gnu ia64-linux-gnu i686-linux-gnu mips-linux-gnu mips64-linux-gnu-n32 powerpc-linux-gnu powerpc64le-linux-gnu x86_64-linux-gnu.
* math: x86: Use prefix for FP_INIT_ROUNDMODEAdhemerval Zanella2022-09-051-1/+7
| | | | | | | Not all compilers support the inline asm prefix '%v' to emit the avx instruction if AVX is enable. Use a prefix instead. Checked on x86_64-linux-gnu and i686-linux-gnu.
* nptl: x86_64: Use same code for CURRENT_STACK_FRAME and stackinfo_get_spAdhemerval Zanella2022-08-311-1/+3
| | | | | | | | | | | | | It avoids the possible warning of uninitialized 'frame' variable when building with clang: ../sysdeps/nptl/jmp-unwind.c:27:42: error: variable 'frame' is uninitialized when used here [-Werror,-Wuninitialized] __pthread_cleanup_upto (env->__jmpbuf, CURRENT_STACK_FRAME); The resulting code is similar to CURRENT_STACK_FRAME. Checked on x86_64-linux-gnu.
* x86: Add support to build strcmp/strlen/strchr with explicit ISA levelNoah Goldstein2022-07-161-0/+10
| | | | | | | | | | | | | | | | | | | | 1. Add default ISA level selection in non-multiarch/rtld implementations. 2. Add ISA level build guards to different implementations. - I.e strcmp-avx2.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (strcmp-evex.S). 3. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add missing rtm tests for strcmp familyNoah Goldstein2022-07-136-2/+150
| | | | | | | | | | Add new tests for: strcasecmp strncasecmp strcmp wcscmp These functions all have avx2_rtm implementations so should be tested.
* x86: Add support for building {w}memcmp{eq} with explicit ISA levelNoah Goldstein2022-07-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Refactor files so that all implementations are in the multiarch directory - Moved the implementation portion of memcmp sse2 from memcmp.S to multiarch/memcmp-sse2.S - The non-multiarch file now only includes one of the implementations in the multiarch directory based on the compiled ISA level (only used for non-multiarch builds. Otherwise we go through the ifunc selector). 2. Add ISA level build guards to different implementations. - I.e memcmp-avx2-movsb.S which is ISA level 3 will only build if compiled ISA level <= 3. Otherwise there is no reason to include it as we will always use one of the ISA level 4 implementations (memcmp-evex-movbe.S). 3. Add new multiarch/rtld-{w}memcmp{eq}.S that just include the non-multiarch {w}memcmp{eq}.S which will in turn select the best implementation based on the compiled ISA level. 4. Refactor the ifunc selector and ifunc implementation list to use the ISA level aware wrapper macros that allow functions below the compiled ISA level (with a guranteed replacement) to be skipped. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.
* x86: Add more feature definitions to isa-level.hNoah Goldstein2022-06-281-0/+15
| | | | | This commit doesn't change anything in itself. It is just to add definitions that will be needed by future patches.
* x86-64: Only define used SSE/AVX/AVX512 run-time resolversH.J. Lu2022-06-271-0/+2
| | | | | | | When glibc is built with x86-64 ISA level v3, SSE run-time resolvers aren't used. For x86-64 ISA level v4 build, both SSE and AVX resolvers are unused. Check the minimum x86-64 ISA level to exclude the unused run-time resolvers.
* x86: Move CPU_FEATURE{S}_{USABLE|ARCH}_P to isa-level.hH.J. Lu2022-06-272-27/+24
| | | | | Move X86_ISA_CPU_FEATURE_USABLE_P and X86_ISA_CPU_FEATURES_ARCH_P to where MINIMUM_X86_ISA_LEVEL and XXX_X86_ISA_LEVEL are defined.
* x86: Fix backwards Prefer_No_VZEROUPPER check in ifunc-evex.hNoah Goldstein2022-06-272-24/+32
| | | | | | | | | Add third argument to X86_ISA_CPU_FEATURES_ARCH_P macro so the runtime CPU_FEATURES_ARCH_P check can be inverted if the MINIMUM_X86_ISA_LEVEL is not high enough to constantly evaluate the check. Use this new macro to correct the backwards check in ifunc-evex.h
* x86: Add defines / utilities for making ISA specific x86 buildsNoah Goldstein2022-06-224-13/+180
| | | | | | | | | | | | | | | 1. Factor out some of the ISA level defines in isa-level.c to standalone header isa-level.h 2. Add new headers with ISA level dependent macros for handling ifuncs. Note, this file does not change any code. Tested with and without multiarch on x86_64 for ISA levels: {generic, x86-64-v2, x86-64-v3, x86-64-v4} And m32 with and without multiarch.