| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
* sysdeps/x86/cpu-features.c (init_cpu_features): Use
index_cpu_RTM and reg_RTM to clear the bit_cpu_RTM bit.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Patch disables Intel TSX on some Haswell processors to avoid TSX
on kernels that weren't updated with the latest microcode package
(which disables broken feature by default).
* sysdeps/x86/cpu-features.c (get_common_indeces): Add
stepping identification.
(init_cpu_features): Add handle of Haswell.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the Intel Architecture Instruction Set Extensions Programming
reference the recommended way to test for FMA in section
'2.2.1 Detection of FMA' is:
"Application Software must identify that hardware supports AVX as
explained in ... after that it must also detect support for FMA..."
We don't do that in glibc. We use osxsave to detect the use of xgetbv,
and after that we check for AVX and FMA orthogonally. It is conceivable
that you could have the AVX bit clear and the FMA bit in an undefined
state.
This commit fixes FMA and AVX2 detection to depend on usable AVX
as required by the recommended Intel sequences.
v1: https://www.sourceware.org/ml/libc-alpha/2016-10/msg00241.html
v2: https://www.sourceware.org/ml/libc-alpha/2016-10/msg00265.html
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is transition penalty when SSE instructions are mixed with 256-bit
AVX or 512-bit AVX512 load instructions. Since _dl_runtime_resolve_avx
and _dl_runtime_profile_avx512 save/restore 256-bit YMM/512-bit ZMM
registers, there is transition penalty when SSE instructions are used
with lazy binding on AVX and AVX512 processors.
To avoid SSE transition penalty, if only the lower 128 bits of the first
8 vector registers are non-zero, we can preserve %xmm0 - %xmm7 registers
with the zero upper bits.
For AVX and AVX512 processors which support XGETBV with ECX == 1, we can
use XGETBV with ECX == 1 to check if the upper 128 bits of YMM registers
or the upper 256 bits of ZMM registers are zero. We can restore only the
non-zero portion of vector registers with AVX/AVX512 load instructions
which will zero-extend upper bits of vector registers.
This patch adds _dl_runtime_resolve_sse_vex which saves and restores
XMM registers with 128-bit AVX store/load instructions. It is used to
preserve YMM/ZMM registers when only the lower 128 bits are non-zero.
_dl_runtime_resolve_avx_opt and _dl_runtime_resolve_avx512_opt are added
and used on AVX/AVX512 processors supporting XGETBV with ECX == 1 so
that we store and load only the non-zero portion of vector registers.
This avoids SSE transition penalty caused by _dl_runtime_resolve_avx and
_dl_runtime_profile_avx512 when only the lower 128 bits of vector
registers are used.
_dl_runtime_resolve_avx_slow is added and used for AVX processors which
don't support XGETBV with ECX == 1. Since there is no SSE transition
penalty on AVX512 processors which don't support XGETBV with ECX == 1,
_dl_runtime_resolve_avx512_slow isn't provided.
[BZ #20495]
[BZ #20508]
* sysdeps/x86/cpu-features.c (init_cpu_features): For Intel
processors, set Use_dl_runtime_resolve_slow and set
Use_dl_runtime_resolve_opt if XGETBV suports ECX == 1.
* sysdeps/x86/cpu-features.h (bit_arch_Use_dl_runtime_resolve_opt):
New.
(bit_arch_Use_dl_runtime_resolve_slow): Likewise.
(index_arch_Use_dl_runtime_resolve_opt): Likewise.
(index_arch_Use_dl_runtime_resolve_slow): Likewise.
* sysdeps/x86_64/dl-machine.h (elf_machine_runtime_setup): Use
_dl_runtime_resolve_avx512_opt and _dl_runtime_resolve_avx_opt
if Use_dl_runtime_resolve_opt is set. Use
_dl_runtime_resolve_slow if Use_dl_runtime_resolve_slow is set.
* sysdeps/x86_64/dl-trampoline.S: Include <cpu-features.h>.
(_dl_runtime_resolve_opt): New. Defined for AVX and AVX512.
(_dl_runtime_resolve): Add one for _dl_runtime_resolve_sse_vex.
* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_avx_slow):
New.
(_dl_runtime_resolve_opt): Likewise.
(_dl_runtime_profile): Define only if _dl_runtime_profile is
defined.
|
|
|
|
|
|
|
|
|
|
|
| |
Since the FMA4 bit is in COMMON_CPUID_INDEX_80000001 and FMA4 requires
AVX, determine if FMA4 is usable after COMMON_CPUID_INDEX_80000001 is
available and if AVX is usable.
[BZ #20195]
* sysdeps/x86/cpu-features.c (get_common_indeces): Move FMA4
check to ...
(init_cpu_features): Here.
|
|
|
|
|
|
|
|
|
| |
Updated from the model numbers of Goldmont and Airmont processors in
Intel64 And IA-32 Processor Architectures Software Developer's Manual
Volume 3 Revision 058.
* sysdeps/x86/cpu-features.c (init_cpu_features): Detect Intel
Goldmont and Airmont processors.
|
|
|
|
|
|
|
|
|
| |
Intel Core i3, i5 and i7 processors have fast unaligned copy and
copy backward is ignored. Remove Fast_Copy_Backward from Intel Core
processors to avoid confusion.
* sysdeps/x86/cpu-features.c (init_cpu_features): Don't set
bit_arch_Fast_Copy_Backward for Intel Core proessors.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On AMD processors, memcpy optimized with unaligned SSE load is
slower than emcpy optimized with aligned SSSE3 while other string
functions are faster with unaligned SSE load. A feature bit,
Fast_Unaligned_Copy, is added to select memcpy optimized with
unaligned SSE load.
[BZ #19583]
* sysdeps/x86/cpu-features.c (init_cpu_features): Set
Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
processors. Set Fast_Copy_Backward for AMD Excavator
processors.
* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
New.
(index_arch_Fast_Unaligned_Copy): Likewise.
* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since only Intel processors with AVX2 have fast unaligned load, we
should set index_arch_AVX_Fast_Unaligned_Load only for Intel processors.
Move AVX, AVX2, AVX512, FMA and FMA4 detection into get_common_indeces
and call get_common_indeces for other processors.
Add CPU_FEATURES_CPU_P and CPU_FEATURES_ARCH_P to aoid loading
GLRO(dl_x86_cpu_features) in cpu-features.c.
[BZ #19583]
* sysdeps/x86/cpu-features.c (get_common_indeces): Remove
inline. Check family before setting family, model and
extended_model. Set AVX, AVX2, AVX512, FMA and FMA4 usable
bits here.
(init_cpu_features): Replace HAS_CPU_FEATURE and
HAS_ARCH_FEATURE with CPU_FEATURES_CPU_P and
CPU_FEATURES_ARCH_P. Set index_arch_AVX_Fast_Unaligned_Load
for Intel processors with usable AVX2. Call get_common_indeces
for other processors with family == NULL.
* sysdeps/x86/cpu-features.h (CPU_FEATURES_CPU_P): New macro.
(CPU_FEATURES_ARCH_P): Likewise.
(HAS_CPU_FEATURE): Use CPU_FEATURES_CPU_P.
(HAS_ARCH_FEATURE): Use CPU_FEATURES_ARCH_P.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
index_* and bit_* macros are used to access cpuid and feature arrays o
struct cpu_features. It is very easy to use bits and indices of cpuid
array on feature array, especially in assembly codes. For example,
sysdeps/i386/i686/multiarch/bcopy.S has
HAS_CPU_FEATURE (Fast_Rep_String)
which should be
HAS_ARCH_FEATURE (Fast_Rep_String)
We change index_* and bit_* to index_cpu_*/index_arch_* and
bit_cpu_*/bit_arch_* so that we can catch such error at build time.
[BZ #19762]
* sysdeps/unix/sysv/linux/x86_64/64/dl-librecon.h
(EXTRA_LD_ENVVARS): Add _arch_ to index_*/bit_*.
* sysdeps/x86/cpu-features.c (init_cpu_features): Likewise.
* sysdeps/x86/cpu-features.h (bit_*): Renamed to ...
(bit_arch_*): This for feature array.
(bit_*): Renamed to ...
(bit_cpu_*): This for cpu array.
(index_*): Renamed to ...
(index_arch_*): This for feature array.
(index_*): Renamed to ...
(index_cpu_*): This for cpu array.
[__ASSEMBLER__] (HAS_FEATURE): Add and use field.
[__ASSEMBLER__] (HAS_CPU_FEATURE)): Pass cpu to HAS_FEATURE.
[__ASSEMBLER__] (HAS_ARCH_FEATURE)): Pass arch to HAS_FEATURE.
[!__ASSEMBLER__] (HAS_CPU_FEATURE): Replace index_##name and
bit_##name with index_cpu_##name and bit_cpu_##name.
[!__ASSEMBLER__] (HAS_ARCH_FEATURE): Replace index_##name and
bit_##name with index_arch_##name and bit_arch_##name.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GLIBC benchtest testcases shows SSE2_Unaligned based implementations
are performing faster compare to SSE2 based implementations for
routines: strcmp, strcat, strncat, stpcpy, stpncpy, strcpy, strncpy
and strstr. Flag index_Fast_Unaligned_Load is set for Excavator family
0x15h CPU's. This makes SSE2_Unaligned based implementations as
default for these routines.
[BZ #19467]
* sysdeps/x86/cpu-features.c (init_cpu_features): Set
index_Fast_Unaligned_Load flag for Excavator family CPUs.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It shows improvement up to 28% over AVX2 memset (performance results
attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).
* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
* sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
* sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
index_Prefer_No_VZEROUPPER): New.
* sysdeps/x86/cpu-features.c (init_cpu_features): Set the
Prefer_No_VZEROUPPER for Knights Landing.
|
|
|
|
|
|
|
|
| |
Knights Landing processor is based on Silvermont. This patch enables
Silvermont optimizations for Knights Landing.
* sysdeps/x86/cpu-features.c (init_cpu_features): Enable
Silvermont optimizations for Knights Landing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AMD CPUs uses the similar encoding scheme for extended family and model
as Intel CPUs as shown in:
http://support.amd.com/TechDocs/25481.pdf
This patch updates get_common_indeces to get family and model for both
Intel and AMD CPUs when family == 0x0f.
[BZ #19214]
* sysdeps/x86/cpu-features.c (get_common_indeces): Add an
argument to return extended model. Update family and model
with extended family and model when family == 0x0f.
(init_cpu_features): Updated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We detect i586 and i686 features at run-time by checking CX8 and CMOV
CPUID features bits. We can use these information to select the best
implementation in ix86 multiarch. HAS_I586/HAS_I686 is true if i586/i686
instructions are available on the processor.
Due to the reordering and the other nifty extensions in i686, it is not
really good to use heavily i586 optimized code on an i686. It's better
to use i486 code if it isn't an i586. USE_I586/USE_I686 is true if
i586/i686 implementation should be used for the processor. USE_I586
is true only if i686 instructions aren't available. If i686 instructions
are available, we always choose i686 or i486 implementation, in that order,
and we never choose i586 implementation for i686-class processors.
* sysdeps/i386/init-arch.h: New file.
* sysdeps/i386/i586/init-arch.h: Likewise.
* sysdeps/i386/i686/init-arch.h: Likewise.
* sysdeps/x86/cpu-features.c (init_cpu_features): Set bit_I586
bit if CX8 is available. Set bit_I686 bit if CMOV is available.
* sysdeps/x86/cpu-features.h (bit_I586): New.
(bit_I686): Likewise.
(bit_CX8): Likewise.
(bit_CMOV): Likewise.
(index_CX8): Likewise.
(index_CMOV): Likewise.
(index_I586): Likewise.
(index_I686): Likewise.
(reg_CX8): Likewise.
(reg_CMOV): Likewise.
(HAS_I586): Defined as HAS_ARCH_FEATURE (I586) if i586 isn't
available at compile-time.
(HAS_I686): Defined as HAS_ARCH_FEATURE (I686) if i686 isn't
available at compile-time.
* sysdeps/x86/init-arch.h (USE_I586): New macro.
(USE_I686): Likewise.
|
|
|
|
|
|
|
|
|
|
|
|
| |
cpuid, i586 and i686 instructions are available if the processor
specified by -march= supports them. We can use this information
to determine whether those instructions can be used safely.
* sysdeps/x86/cpu-features.c (init_cpu_features): Check
whether cpuid is available only if HAS_CPUID is 0.
* sysdeps/x86/cpu-features.h (HAS_CPUID): New.
(HAS_I586): Likewise.
(HAS_I686): Likewise.
|
|
|
|
|
|
|
|
|
| |
Since not all i486 processors support cpuid, we call __get_cpuid_max to
check if cpuid is available before using it if not compiling for i586,
i686 nor x86-64.
* sysdeps/x86/cpu-features.c (init_cpu_features): Call
__get_cpuid_max if not compiling for i586, i686 nor x86-64.
|
|
This patch adds _dl_x86_cpu_features to rtld_global in x86 ld.so
and initializes it early before __libc_start_main is called so that
cpu_features is always available when it is used and we can avoid
calling __init_cpu_features in IFUNC selectors.
* sysdeps/i386/dl-machine.h: Include <cpu-features.c>.
(dl_platform_init): Call init_cpu_features.
* sysdeps/i386/dl-procinfo.c (_dl_x86_cpu_features): New.
* sysdeps/i386/i686/cacheinfo.c
(DISABLE_PREFERRED_MEMORY_INSTRUCTION): Removed.
* sysdeps/i386/i686/multiarch/Makefile (aux): Remove init-arch.
* sysdeps/i386/i686/multiarch/Versions: Removed.
* sysdeps/i386/i686/multiarch/ifunc-defines.sym (KIND_OFFSET):
Removed.
* sysdeps/i386/ldsodefs.h: Include <cpu-features.h>.
* sysdeps/unix/sysv/linux/x86/Makefile
(libpthread-sysdep_routines): Remove init-arch.
* sysdeps/unix/sysv/linux/x86_64/dl-procinfo.c: Include
<sysdeps/x86_64/dl-procinfo.c> instead of
sysdeps/generic/dl-procinfo.c>.
* sysdeps/x86/Makefile [$(subdir) == csu] (gen-as-const-headers):
Add cpu-features-offsets.sym and rtld-global-offsets.sym.
[$(subdir) == elf] (sysdep-dl-routines): Add dl-get-cpu-features.
[$(subdir) == elf] (tests): Add tst-get-cpu-features.
[$(subdir) == elf] (tests-static): Add
tst-get-cpu-features-static.
* sysdeps/x86/Versions: New file.
* sysdeps/x86/cpu-features-offsets.sym: Likewise.
* sysdeps/x86/cpu-features.c: Likewise.
* sysdeps/x86/cpu-features.h: Likewise.
* sysdeps/x86/dl-get-cpu-features.c: Likewise.
* sysdeps/x86/libc-start.c: Likewise.
* sysdeps/x86/rtld-global-offsets.sym: Likewise.
* sysdeps/x86/tst-get-cpu-features-static.c: Likewise.
* sysdeps/x86/tst-get-cpu-features.c: Likewise.
* sysdeps/x86_64/dl-procinfo.c: Likewise.
* sysdeps/x86_64/cacheinfo.c (__cpuid_count): Removed.
Assume USE_MULTIARCH is defined and don't check it.
(is_intel): Replace __cpu_features with GLRO(dl_x86_cpu_features).
(is_amd): Likewise.
(max_cpuid): Likewise.
(intel_check_word): Likewise.
(__cache_sysconf): Don't call __init_cpu_features.
(__x86_preferred_memory_instruction): Removed.
(init_cacheinfo): Don't call __init_cpu_features. Replace
__cpu_features with GLRO(dl_x86_cpu_features).
* sysdeps/x86_64/dl-machine.h: <cpu-features.c>.
(dl_platform_init): Call init_cpu_features.
* sysdeps/x86_64/ldsodefs.h: Include <cpu-features.h>.
* sysdeps/x86_64/multiarch/Makefile (aux): Remove init-arch.
* sysdeps/x86_64/multiarch/Versions: Removed.
* sysdeps/x86_64/multiarch/cacheinfo.c: Likewise.
* sysdeps/x86_64/multiarch/init-arch.c: Likewise.
* sysdeps/x86_64/multiarch/ifunc-defines.sym (KIND_OFFSET):
Removed.
* sysdeps/x86_64/multiarch/init-arch.h: Rewrite.
|