about summary refs log tree commit diff
path: root/sysdeps/x86
Commit message (Collapse)AuthorAgeFilesLines
* x86: Check the lower byte of EAX of CPUID leaf 2 [BZ #30643]H.J. Lu2023-08-291-18/+13
| | | | | | | | | | | The old Intel software developer manual specified that the low byte of EAX of CPUID leaf 2 returned 1 which indicated the number of rounds of CPUDID leaf 2 was needed to retrieve the complete cache information. The newer Intel manual has been changed to that it should always return 1 and be ignored. If the lower byte isn't 1, CPUID leaf 2 can't be used. In this case, we ignore CPUID leaf 2 and use CPUID leaf 4 instead. If CPUID leaf 4 doesn't contain the cache information, cache information isn't available at all. This addresses BZ #30643.
* x86: Fix incorrect scope of setting `shared_per_thread` [BZ# 30745]Noah Goldstein2023-08-111-4/+3
| | | | | | | | | | | | | | The: ``` if (shared_per_thread > 0 && threads > 0) shared_per_thread /= threads; ``` Code was accidentally moved to inside the else scope. This doesn't match how it was previously (before af992e7abd). This patch fixes that by putting the division after the `else` block.
* x86: Fix for cache computation on AMD legacy cpus.Sajan Karumanchi2023-08-061-27/+199
| | | | | | | | | | | Some legacy AMD CPUs and hypervisors have the _cpuid_ '0x8000_001D' set to Zero, thus resulting in zeroed-out computed cache values. This patch reintroduces the old way of cache computation as a fail-safe option to handle these exceptions. Fixed 'level4_cache_size' value through handle_amd(). Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com> Tested-by: Florian Weimer <fweimer@redhat.com>
* <sys/platform/x86.h>: Add APX supportH.J. Lu2023-07-274-0/+11
| | | | | | | | Add support for Intel Advanced Performance Extensions: https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html to <sys/platform/x86.h>.
* [PATCH v1] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold.Noah Goldstein2023-07-181-3/+12
| | | | | | | | | On some machines we end up with incomplete cache information. This can make the new calculation of `sizeof(total-L3)/custom-divisor` end up lower than intended (and lower than the prior value). So reintroduce the old bound as a lower bound to avoid potentially regressing code where we don't have complete information to make the decision. Reviewed-by: DJ Delorie <dj@redhat.com>
* x86: Fix slight bug in `shared_per_thread` cache size calculation.Noah Goldstein2023-07-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | After: ``` commit af992e7abdc9049714da76cae1e5e18bc4838fb8 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 7 13:18:01 2023 -0500 x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` ``` Split `shared` (cumulative cache size) from `shared_per_thread` (cache size per socket), the `shared_per_thread` *can* be slightly off from the previous calculation. Previously we added `core` even if `threads_l2` was invalid, and only used `threads_l2` to divide `core` if it was present. The changed version only included `core` if `threads_l2` was valid. This change restores the old behavior if `threads_l2` is invalid by adding the entire value of `core`. Reviewed-by: DJ Delorie <dj@redhat.com>
* configure: Use autoconf 2.71Siddhesh Poyarekar2023-07-171-46/+52
| | | | | | | | | | | | | | Bump autoconf requirement to 2.71 to allow regenerating configure on more recent distributions. autoconf 2.71 has been in Fedora since F36 and is the current version in Debian stable (bookworm). It appears to be current in Gentoo as well. All sysdeps configure and preconfigure scripts have also been regenerated; all changes are trivial transformations that do not affect functionality. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* x86: Make dl-cache.h and readelflib.c not Linux-specificSergey Bugaev2023-06-261-0/+90
| | | | | | | These files could be useful to any port that wants to use ld.so.cache. Signed-off-by: Sergey Bugaev <bugaevc@gmail.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* Fix misspellings -- BZ 25337Paul Pluzhnikov2023-06-192-2/+2
|
* x86: Make the divisor in setting `non_temporal_threshold` cpu specificNoah Goldstein2023-06-124-26/+51
| | | | | | | | | | | | | | | | Different systems prefer a different divisors. From benchmarks[1] so far the following divisors have been found: ICX : 2 SKX : 2 BWD : 8 For Intel, we are generalizing that BWD and older prefers 8 as a divisor, and SKL and newer prefers 2. This number can be further tuned as benchmarks are run. [1]: https://github.com/goldsteinn/memcpy-nt-benchmarks Reviewed-by: DJ Delorie <dj@redhat.com>
* x86: Refactor Intel `init_cpu_features`Noah Goldstein2023-06-121-81/+309
| | | | | | | | | | | | | | | This patch should have no affect on existing functionality. The current code, which has a single switch for model detection and setting prefered features, is difficult to follow/extend. The cases use magic numbers and many microarchitectures are missing. This makes it difficult to reason about what is implemented so far and/or how/where to add support for new features. This patch splits the model detection and preference setting stages so that CPU preferences can be set based on a complete list of available microarchitectures, rather than based on model magic numbers. Reviewed-by: DJ Delorie <dj@redhat.com>
* x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`Noah Goldstein2023-06-121-27/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 / ncores_per_socket'. This patch updates that value to roughly 'sizeof_L3 / 4` The original value (specifically dividing the `ncores_per_socket`) was done to limit the amount of other threads' data a `memcpy`/`memset` could evict. Dividing by 'ncores_per_socket', however leads to exceedingly low non-temporal thresholds and leads to using non-temporal stores in cases where REP MOVSB is multiple times faster. Furthermore, non-temporal stores are written directly to main memory so using it at a size much smaller than L3 can place soon to be accessed data much further away than it otherwise could be. As well, modern machines are able to detect streaming patterns (especially if REP MOVSB is used) and provide LRU hints to the memory subsystem. This in affect caps the total amount of eviction at 1/cache_associativity, far below meaningfully thrashing the entire cache. As best I can tell, the benchmarks that lead this small threshold where done comparing non-temporal stores versus standard cacheable stores. A better comparison (linked below) is to be REP MOVSB which, on the measure systems, is nearly 2x faster than non-temporal stores at the low-end of the previous threshold, and within 10% for over 100MB copies (well past even the current threshold). In cases with a low number of threads competing for bandwidth, REP MOVSB is ~2x faster up to `sizeof_L3`. The divisor of `4` is a somewhat arbitrary value. From benchmarks it seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs such as Broadwell prefer something closer to `8`. This patch is meant to be followed up by another one to make the divisor cpu-specific, but in the meantime (and for easier backporting), this patch settles on `4` as a middle-ground. Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable stores where done using: https://github.com/goldsteinn/memcpy-nt-benchmarks Sheets results (also available in pdf on the github): https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml Reviewed-by: DJ Delorie <dj@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* Fix a few more typos I missed in previous round -- BZ 25337Paul Pluzhnikov2023-06-021-1/+1
|
* Fix misspellings in sysdeps/ -- BZ 25337Paul Pluzhnikov2023-05-303-3/+3
|
* x86: Use 64MB as nt-store threshold if no cacheinfo [BZ #30429]Noah Goldstein2023-05-271-1/+9
| | | | | | | | | | | | If `non_temporal_threshold` is below `minimum_non_temporal_threshold`, it almost certainly means we failed to read the systems cache info. In this case, rather than defaulting the minimum correct value, we should default to a value that gets at least reasonable performance. 64MB is chosen conservatively to be at the very high end. This should never cause non-temporal stores when, if we had read cache info, we wouldn't have otherwise. Reviewed-by: Florian Weimer <fweimer@redhat.com>
* <sys/platform/x86.h>: Add PREFETCHI supportH.J. Lu2023-04-054-0/+7
| | | | | Add PREFETCHI support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AMX-COMPLEX supportH.J. Lu2023-04-054-0/+8
| | | | | Add AMX-COMPLEX support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AVX-NE-CONVERT supportH.J. Lu2023-04-054-0/+8
| | | | | Add AVX-NE-CONVERT support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AVX-VNNI-INT8 supportH.J. Lu2023-04-054-0/+17
| | | | | Add AVX-VNNI-INT8 support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add MSRLIST supportH.J. Lu2023-04-052-0/+2
| | | | | Add MSRLIST support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AVX-IFMA supportH.J. Lu2023-04-054-0/+8
| | | | | Add AVX-IFMA support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add AMX-FP16 supportH.J. Lu2023-04-054-0/+8
| | | | | Add AMX-FP16 support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add WRMSRNS supportH.J. Lu2023-04-052-0/+2
| | | | | Add WRMSRNS support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add ArchPerfmonExt supportH.J. Lu2023-04-052-0/+2
| | | | | | Add Architectural Performance Monitoring Extended Leaf (EAX = 23H) support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add CMPCCXADD supportH.J. Lu2023-04-054-0/+7
| | | | | Add CMPCCXADD support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add LASS supportH.J. Lu2023-04-052-0/+2
| | | | | Add Linear Address Space Separation (LASS) support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add RAO-INT supportH.J. Lu2023-04-054-0/+7
| | | | | Add RAO-INT support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add LBR supportH.J. Lu2023-04-052-1/+2
| | | | | Add architectural LBR support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add RTM_FORCE_ABORT supportH.J. Lu2023-04-052-1/+2
| | | | | Add RTM_FORCE_ABORT support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add SGX-KEYS supportH.J. Lu2023-04-052-1/+2
| | | | | Add SGX-KEYS support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add BUS_LOCK_DETECT supportH.J. Lu2023-04-052-1/+2
| | | | | | Add Bus lock debug exceptions (BUS_LOCK_DETECT) support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <sys/platform/x86.h>: Add LA57 supportH.J. Lu2023-04-052-1/+2
| | | | | | Add 57-bit linear addresses and five-level paging (LA57) support to <sys/platform/x86.h>. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* <bits/platform/x86.h>: Rename to x86_cpu_INDEX_7_ECX_15H.J. Lu2023-04-051-1/+1
| | | | | | Rename x86_cpu_INDEX_7_ECX_1 to x86_cpu_INDEX_7_ECX_15 for the unused bit 15 in ECX from CPUID with EAX == 0x7 and ECX == 0. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86/dl-cacheinfo: remove unsused parameter from handle_amdAndreas Schwab2023-04-041-36/+30
| | | | Also replace an unreachable assert with __builtin_unreachable.
* x86: Set FSGSBASE to active if enabled by kernelH.J. Lu2023-04-033-0/+31
| | | | | | | | | | | | | Linux kernel uses AT_HWCAP2 to indicate if FSGSBASE instructions are enabled. If the HWCAP2_FSGSBASE bit in AT_HWCAP2 is set, FSGSBASE instructions can be used in user space. Define dl_check_hwcap2 to set the FSGSBASE feature to active on Linux when the HWCAP2_FSGSBASE bit is set. Add a test to verify that FSGSBASE is active on current kernels. NB: This test will fail if the kernel doesn't set the HWCAP2_FSGSBASE bit in AT_HWCAP2 while fsgsbase shows up in /proc/cpuinfo. Reviewed-by: Florian Weimer <fweimer@redhat.com>
* Remove --enable-tunables configure optionAdhemerval Zanella Netto2023-03-295-68/+29
| | | | | | | | | | | | And make always supported. The configure option was added on glibc 2.25 and some features require it (such as hwcap mask, huge pages support, and lock elisition tuning). It also simplifies the build permutations. Changes from v1: * Remove glibc.rtld.dynamic_sort changes, it is orthogonal and needs more discussion. * Cleanup more code. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* x86: Don't check PREFETCHWT1 in tst-cpu-features-cpuinfo.cDJ Delorie2023-03-211-0/+3
| | | | | | | Don't check PREFETCHWT1 against /proc/cpuinfo since kernel doesn't report PREFETCHWT1 in /proc/cpuinfo. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: Fix bug about glibc.cpu.hwcaps.caiyinyu2023-03-071-3/+3
| | | | | | | | | | | | | | | | | | | Recorded in [BZ #30183]: 1. export GLIBC_TUNABLES=glibc.cpu.hwcaps=-AVX512 2. Add _dl_printf("p -- %s\n", p); just before switch(nl) in sysdeps/x86/cpu-tunables.c 3. compiled and run ./testrun.sh /usr/bin/ls you will get: p -- -AVX512 p -- LC_ADDRESS=en_US.UTF-8 p -- LC_NUMERIC=C ... The function, TUNABLE_CALLBACK (set_hwcaps) (tunable_val_t *valp), checks far more than it should and it should stop at end of "-AVX512".
* x86-64: Add glibc.cpu.prefer_map_32bit_exec [BZ #28656]H.J. Lu2023-02-222-0/+16
| | | | | | | | | | | | | | | | | | | Crossing 2GB boundaries with indirect calls and jumps can use more branch prediction resources on Intel Golden Cove CPU (see the "Misprediction for Branches >2GB" section in Intel 64 and IA-32 Architectures Optimization Reference Manual.) There is visible performance improvement on workloads with many PLT calls when executable and shared libraries are mmapped below 2GB. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable or denywrite pages in shared libraries with MAP_32BIT first. NB: Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by the tunable, glibc.cpu.prefer_map_32bit_exec, or the environment variable, LD_PREFER_MAP_32BIT_EXEC. This works only between shared libraries or between shared libraries and executables with addresses below 2GB. PIEs are usually loaded at a random address above 4GB by the kernel.
* string: Remove string_private.hAdhemerval Zanella2023-02-171-20/+0
| | | | | | Now that _STRING_ARCH_unaligned is not used anymore. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
* htl: Generalize i386 pt-machdep.h to x86Samuel Thibault2023-02-121-0/+28
|
* x86: Cache computation for AMD architecture.Sajan Karumanchi2023-01-181-159/+45
| | | | | | | | All AMD architectures cache details will be computed based on __cpuid__ `0x8000_001D` and the reference to __cpuid__ `0x8000_0006` will be zeroed out for future architectures. Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
* Update copyright dates with scripts/update-copyrightsJoseph Myers2023-01-06121-121/+121
|
* x86: Check minimum/maximum of non_temporal_threshold [BZ #29953]H.J. Lu2023-01-031-9/+16
| | | | | | | | | The minimum non_temporal_threshold is 0x4040. non_temporal_threshold may be set to less than the minimum value when the shared cache size isn't available (e.g., in an emulator) or by the tunable. Add checks for minimum and maximum of non_temporal_threshold. This fixes BZ #29953.
* x86_64: State assembler is being tested on sysdeps/x86/configureAdhemerval Zanella2022-12-062-3/+3
|
* configure: Remove AS checkAdhemerval Zanella2022-12-062-3/+3
| | | | | | The assembler is not issued directly, but rather always through CC wrapper. The binutils version check if done with LD instead. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* elf: Remove _dl_string_hwcapJavier Pello2022-10-061-14/+0
| | | | | | | | Removal of legacy hwcaps support from the dynamic loader left no users of _dl_string_hwcap. Signed-off-by: Javier Pello <devel@otheo.eu> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementationsAurelien Jarno2022-10-031-0/+1
| | | | | | | | | | | The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk' instruction which belongs to the BMI1 CPU feature and the 'shrx' instruction, which belongs to the BMI2 CPU feature. Fixes: df7e295d18ff ("x86: Optimize {str|wcs}rchr-avx2") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementationAurelien Jarno2022-10-031-0/+1
| | | | | | | | | | | The AVX2 memrchr implementation uses the 'shlxl' instruction, which belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which belongs to the LZCNT CPU feature. Fixes: af5306a735eb ("x86: Optimize memrchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* x86: include BMI1 and BMI2 in x86-64-v3 levelAurelien Jarno2022-10-031-0/+2
| | | | | | | | The "System V Application Binary Interface AMD64 Architecture Processor Supplement" mandates the BMI1 and BMI2 CPU features for the x86-64-v3 level. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>