| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Was missing to for the multiarch build rtld-strncmp-sse4_2.os was
being built and exporting symbols:
build/glibc/string/rtld-strncmp-sse4_2.os:
0000000000000000 T __strncmp_sse42
Introduced in:
commit 11ffcacb64a939c10cfc713746b8ec88837f5c4a
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Wed Jun 21 12:10:50 2017 -0700
x86-64: Implement strcmp family IFUNC selectors in C
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Was missing to for the multiarch build rtld-strcspn-sse4.os was
being built and exporting symbols:
build/glibc/string/rtld-strcspn-sse4.os:
U ___m128i_shift_right
U __strcspn_generic
0000000000000000 T __strcspn_sse42
U strlen
build/glibc/string/rtld-varshift.os:
0000000000000000 R ___m128i_shift_right
Introduced in:
commit 06e51c8f3de38761f8855700841bc49cf495c8c0
Author: H.J. Lu <hongjiu.lu@intel.com>
Date: Fri Jul 3 02:48:56 2009 -0700
Add SSE4.2 support for strcspn, strpbrk, and strspn on x86-64.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Was missing to for the multiarch build rtld-memmove-ssse3.os was
being built and exporting symbols:
>$ nm string/rtld-memmove-ssse3.os
U __GI___chk_fail
0000000000000020 T __memcpy_chk_ssse3
0000000000000040 T __memcpy_ssse3
0000000000000020 T __memmove_chk_ssse3
0000000000000040 T __memmove_ssse3
0000000000000000 T __mempcpy_chk_ssse3
0000000000000010 T __mempcpy_ssse3
U __x86_shared_cache_size_half
Introduced after 2.35 in:
commit 26b2478322db94edc9e0e8f577b2f71d291e5acb
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Thu Apr 14 11:47:40 2022 -0500
x86: Reduce code size of mem{move|pcpy|cpy}-ssse3
|
|
|
|
|
|
|
| |
Properly indent X86_IFUNC_IMPL_ADD_VN arguments for memchr, rawmemchr
and wmemchr.
Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Remove sse2 instructions when using the avx512 or avx version.
2. Fixup some format nits in how the address offsets where aligned.
3. Use more space efficient instructions in the conditional AVX
restoral.
- vpcmpeqq -> vpcmpeqb
- cmp imm32, r; jz -> inc r; jz
4. Use `rep movsb` instead of `rep movsq`. The former is guranteed to
be fast with the ERMS flags, the latter is not. The latter also
wastes an instruction in size setup.
|
|
|
|
|
|
| |
The primary memmove_{impl}_unaligned_erms implementations don't
interact with this function. Putting them in same file both
wastes space and unnecessarily bloats a hot code section.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implementation wise:
1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not
use the L(stosb) label that was previously defined.
2. Don't give the hotpath (fallthrough) to zero size.
Code positioning wise:
Move memset_{chk}_erms to its own file. Leaving it in between the
memset_{impl}_unaligned both adds unnecessary complexity to the
file and wastes space in a relatively hot cache section.
|
|
|
|
| |
This was simply missing and meant we weren't testing it properly.
|
|
|
|
|
|
|
| |
When glibc is built with x86-64 ISA level v3, SSE run-time resolvers
aren't used. For x86-64 ISA level v4 build, both SSE and AVX resolvers
are unused. Check the minimum x86-64 ISA level to exclude the unused
run-time resolvers.
|
|
|
|
|
|
|
|
|
| |
Add third argument to X86_ISA_CPU_FEATURES_ARCH_P macro so the runtime
CPU_FEATURES_ARCH_P check can be inverted if the
MINIMUM_X86_ISA_LEVEL is not high enough to constantly evaluate
the check.
Use this new macro to correct the backwards check in ifunc-evex.h
|
|
|
|
| |
This is in accordance with other files in the multiarch directory.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The memcmp-sse4 was removed in:
commit 7cbc03d03091d5664060924789afe46d30a5477e
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Fri Apr 15 12:28:00 2022 -0500
x86: Remove memcmp-sse4.S
so this file does nothing.
|
|
|
|
|
| |
Previously was missing but the two implementations shouldn't get in
the sse2 (generic) text section.
|
|
|
|
|
|
|
|
|
| |
The function was tuned around 64-byte entry alignment and performs
better for all sizes with it.
As well different code boths where explicitly written to touch the
minimum number of cache line i.e sizes <= 32 touch only the entry
cache line.
|
|
|
|
|
|
|
|
|
|
| |
The sanity tests where meant to ensure that the default implementation
was only being built without multiarch with the exception of the
multiarch/rtld-*.S files.
The code used IS_IN (rtld) to check if the build for was for an
multiarch/rtld-*.S file which is incorrect as IS_IN (rtld) is set for
the non-multiarch build as well.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of these don't really matter as there was no dirty upper state
but we should generally avoid stray sse when its not needed.
The one case that really matters is in svml_d_tanh4_core_avx2.S:
blendvps %xmm0, %xmm8, %xmm7
When there was a dirty upper state.
Tested on x86_64-linux
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Refactor files so that all implementations for in the multiarch
directory.
- Essentially moved sse2 {raw|w}memchr.S implementation to
multiarch/{raw|w}memchr-sse2.S
- The non-multiarch {raw|w}memchr.S file now only includes one of
the implementations in the multiarch directory based on the
compiled ISA level (only used for non-multiarch builds.
Otherwise we go through the ifunc selector).
2. Add ISA level build guards to different implementations.
- I.e memchr-avx2.S which is ISA level 3 will only build if
compiled ISA level <= 3. Otherwise there is no reason to include
it as we will always use one of the ISA level 4
implementations (memchr-evex{-rtm}.S).
3. Add new multiarch/rtld-{raw}memchr.S that just include the
non-multiarch {raw}memchr.S which will in turn select the best
implementation based on the compiled ISA level.
4. Refactor the ifunc selector and ifunc implementation list to use
the ISA level aware wrapper macros that allow functions below the
compiled ISA level (with a guranteed replacement) to be skipped.
- Guranteed replacement essentially means that for any ISA level
build there must be a function that the baseline of the ISA
supports. So for {raw|w}memchr.S since there is not ISA level 2
function, the ISA level 2 build still includes the ISA level
1 (sse2) function. Once we reach the ISA level 3 build, however,
{raw|w}memchr-avx2{-rtm}.S will always be sufficient so the ISA
level 1 implementation ({raw|w}memchr-sse2.S) will not be built.
Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}
And m32 with and without multiarch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Factor out some of the ISA level defines in isa-level.c to
standalone header isa-level.h
2. Add new headers with ISA level dependent macros for handling
ifuncs.
Note, this file does not change any code.
Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}
And m32 with and without multiarch.
|
|
|
|
|
|
|
|
|
|
| |
No functions are changed. It just renames generic implementations from
'{func}_sse2' to '{func}_generic'. This is just because the postfix
"_sse2" was overloaded and was used for files that had hand-optimized
sse2 assembly implementations and files that just redirected back
to the generic implementation.
Full xcheck passed on x86_64.
|
|
|
|
|
|
|
|
|
|
| |
The RTLD_BOOTSTRAP branch is used to relocate ld.so itself. It only
needs to handle RELATIVE, GLOB_DAT, and JUMP_SLOT. RELATIVE has been
handled (by _ELF_DYNAMIC_DO_RELOC due to DT_RELACOUNT, or RELR), so the
switch statement only needs to handle GLOB_DAT and JUMP_SLOT.
We can drop these `#if[n]def RTLD_BOOTSTRAP` and add a large
`# ifndef RTLD_BOOTSTRAP` instead.
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
Previously was using `__x86_rep_movsb_threshold` and should
have been using `__x86_shared_non_temporal_threshold`.
2. Avoid reloading __x86_shared_non_temporal_threshold before
the L(large_memcpy_4x) bounds check.
3. Document the second bounds check for L(large_memcpy_4x)
more clearly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If an executable has copy relocations for extern protected data, that
can only work if the library containing the definition is built with
assumptions (a) the compiler emits GOT-generating relocations (b) the
linker produces R_*_GLOB_DAT instead of R_*_RELATIVE. Otherwise the
library uses its own definition directly and the executable accesses a
stale copy. Note: the GOT relocations defeat the purpose of protected
visibility as an optimization, but allow rtld to make the executable and
library use the same copy when copy relocations are present, but it
turns out this never worked perfectly.
ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA has strange semantics when both
a.so and b.so define protected var and the executable copy relocates
var: b.so accesses its own copy even with GLOB_DAT. The behavior change
is from commit 62da1e3b00b51383ffa7efc89d8addda0502e107 (x86) and then
copied to nios2 (ae5eae7cfc9c4a8297ff82ec6b794faca1976ecc) and arc
(0e7d930c4c11de896fe807f67fa1eb756c9c1e05).
Without ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA, b.so accesses the copy
relocated data like a.so.
There is now a warning for copy relocation on protected symbol since
commit 7374c02b683b7110b853a32496a619410364d70b. It's extremely
unlikely anyone relies on the ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA
behavior, so let's remove it: this removes a check in the symbol lookup
code.
|
|
|
|
|
|
|
|
|
| |
This has been missing since the the ifuncs where added.
The performance of SSE4.2 is preferable to to SSE2.
Measured on Tigerlake with N = 20 runs.
Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC
redundant and fixes several targets that will write outside the array.
To avoid unnecessary large diffs, pass the maximum in the argument 'i' to
IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing
ones can be updated if desired.
Passes buildmanyglibc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Optimizations are:
1. Reduce code size (-112 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Reduce rodata size (-4k+ rodata is shared with avx2).
Result is roughly a 15-16% speedup:
Function, New Time, Old Time, New / Old
_ZGVbN4v_tanhf, 3.158, 3.749, 0.842
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Optimizations are:
1. Reduce code size (-81 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Reduce rodata size (-32 bytes).
Result is roughly a 17-18% speedup:
Function, New Time, Old Time, New / Old
_ZGVdN8v_tanhf, 1.977, 2.402, 0.823
|
|
|
|
|
|
|
|
|
|
| |
tanhf-avx2 and tanhf-sse4 use the same data tables so we can save
over 4kb using a shared datatable. This does increase the memory
footprint of the sse4 version (as now all the targets are 32 bytes
instead of 16), generally it seems worth the code size save.
NB: This patch doesn't do anything itself, it is setup for future
patches.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Optimizations are:
1. Reduce code size (-67 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Reduce rodata usage (-448 bytes).
Result is roughly a 14% speedup:
Function, New Time, Old Time, New / Old
_ZGVeN16v_tanhf, 0.649, 0.752, 0.863
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improvements are:
1. Reduce code size (-62 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Reduce rodata usage (-16 bytes).
The throughput improvement is not significant as the port 0 bottleneck
is unavoidable.
Function, New Time, Old Time, New / Old
_ZGVbN4v_atanhf, 8.821, 8.903, 0.991
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improvements are:
1. Reduce code size (-60 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Prefer registers which get short instruction encoding.
5. Shrink rodata usage (-32 bytes).
The throughput improvement is not that significant (3-5%) as the
port 0 bottleneck is unavoidable.
Function, New Time, Old Time, New / Old
_ZGVdN8v_atanhf, 2.799, 2.923, 0.958
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improvements are:
1. Reduce code size (-64 bytes).
2. Remove redundant move instructions.
3. Slightly improve instruction selection/scheduling where
possible.
4. Reduce rodata size ([-128, -188] bytes).
The throughput improvement is not significant as the port 0 bottleneck
is unavoidable.
Function, New Time, Old Time, New / Old
_ZGVeN16v_atanhf, 1.39, 1.408, 0.987
|
|
|
|
| |
This ensures the load will never split a cache line.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 6dcbb7d95dded20153b12d76d2f4e0ef0cda4f35
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon Jun 6 21:11:33 2022 -0700
x86: Shrink code size of memchr-avx2.S
Changed how the page cross case aligned string (rdi) in
rawmemchr. This was incompatible with how
`L(cross_page_continue)` expected the pointer to be aligned and
would cause rawmemchr to read data start started before the
beginning of the string. What it would read was in valid memory
but could count CHAR matches resulting in an incorrect return
value.
This commit fixes that issue by essentially reverting the changes to
the L(page_cross) case as they didn't really matter.
Test cases added and all pass with the new code (and where confirmed
to fail with the old code).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
| |
Give fall-through path to `vzeroupper` and taken-path to `vzeroall`.
Generally even on machines with RTM the expectation is the
string-library functions will not be called in transactions.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.
The total code size saving is: 64 bytes
There are no non-negligible changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 1.000
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.
The total code size saving is: 59 bytes
There are no major changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 0.967
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 306 bytes
Geometric Mean of all benchmarks New / Old: 0.760
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
10-20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 15-45% speedup.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 263 bytes
Geometric Mean of all benchmarks New / Old: 0.755
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 35% speedup.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new code:
1. prioritizes smaller lengths more.
2. optimizes target placement more carefully.
3. reuses logic more.
4. fixes up various inefficiencies in the logic.
The total code size saving is: 394 bytes
Geometric Mean of all benchmarks New / Old: 0.874
Regressions:
1. The page cross case is now colder, especially re-entry from the
page cross case if a match is not found in the first VEC
(roughly 50%). My general opinion with this patch is this is
acceptable given the "coldness" of this case (less than 4%) and
generally performance improvement in the other far more common
cases.
2. There are some regressions 5-15% for medium/large user-arg
lengths that have a match in the first VEC. This is because the
logic was rewritten to optimize finds in the first VEC if the
user-arg length is shorter (where we see roughly 20-50%
performance improvements). It is not always the case this is a
regression. My intuition is some frontend quirk is partially
explaining the data although I haven't been able to find the
root cause.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The RTM vzeroupper mitigation has no way of replacing inline
vzeroupper not before a return.
This can be useful when hoisting a vzeroupper to save code size
for example:
```
L(foo):
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
VZEROUPPER_RETURN
L(bar):
xorl %eax, %eax
VZEROUPPER_RETURN
```
Can become:
```
L(foo):
COND_VZEROUPPER
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
ret
L(bar):
xorl %eax, %eax
ret
```
This code does not change any existing functionality.
There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
|
|
| |
This patch does not touch any existing code and is only meant to be a
tool for future patches so that simple source files can more easily be
maintained to target multiple VEC classes.
There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adding a 512-bit EVEX version of strstr. The algorithm works as follows:
(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.
(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.
(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.
Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)
Geometric mean of all benchmarks: new / old = 0.66
Difficult skiptable(0) : new / old = 0.02
Difficult skiptable(1) : new / old = 0.01
Difficult 2-way : new / old = 0.25
Difficult testing first 2 : new / old = 1.26
Difficult skiptable(0) : new / old = 0.05
Difficult skiptable(1) : new / old = 0.06
Difficult 2-way : new / old = 0.26
Difficult testing first 2 : new / old = 1.05
Difficult skiptable(0) : new / old = 0.42
Difficult skiptable(1) : new / old = 0.24
Difficult 2-way : new / old = 0.21
Difficult testing first 2 : new / old = 1.04
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
|
|
|
|
|
| |
The compiler may substitute calls to sin or cos with calls to sincos, thus
we should have the same optimized implementations for sincos. The
optimized implementations may produce results that differ, that also makes
sure that the sincos call aggrees with the sin and cos calls.
|
|
|
|
|
|
|
|
|
|
|
| |
Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
|
|
|
|
|
|
|
|
| |
According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we
can ignore their r_addends.
Reviewed-by: Fangrui Song <maskray@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch implements following evex512 version of string functions.
Perf gain for evex512 version is up to 50% as compared to evex,
depending on length and alignment.
Placeholder function, not used by any processor at the moment.
- String length function using 512 bit vectors.
- String N length using 512 bit vectors.
- Wide string length using 512 bit vectors.
- Wide string N length using 512 bit vectors.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Both float, double, and _Float128 are assumed to be supported
(float and double already only uses builtins). Only long double
is parametrized due GCC bug 29253 which prevents its usage on
powerpc.
It allows to remove i686, ia64, x86_64, powerpc, and sparc arch
specific implementation.
On ia64 it also fixes the sNAN handling:
math/test-float64x-fabs
math/test-ldouble-fabs
Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
powerpc64-linux-gnu, sparc64-linux-gnu, and ia64-linux-gnu.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Both symbols are marked as legacy in POSIX.1-2001 and removed on
POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE
or _DEFAULT_SOURCE.
GCC also replaces bcopy with a memmove and bzero with memset on default
configuration (to actually get a bzero libc call the code requires
to omit string.h inclusion and built with -fno-builtin), so it is
highly unlikely programs are actually calling libc bzero symbol.
On a recent Linux distro (Ubuntu 22.04), there is no bzero calls
by the installed binaries.
$ cat count_bstring.sh
#!/bin/bash
files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done`
total=0
for file in $files; do
symbols=`objdump -R $file 2>&1`
if [ $? -eq 0 ]; then
ncalls=`echo $symbols | grep -w $1 | wc -l`
((total=total+ncalls))
if [ $ncalls -gt 0 ]; then
echo "$file: $ncalls"
fi
fi
done
echo "TOTAL=$total"
$ ./count_bstring.sh bzero
TOTAL=0
Checked on x86_64-linux-gnu.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.
Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.
v2: Remove read-check before try-lock for performance.
v3:
1. Restore read-check since it works well in some platform.
2. Make backoff arch dependent, and enable it for x86_64.
3. Limit max backoff to reduce latency in large critical section.
v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
v5: Commit log updated for regression in large critical section.
Result of pthread-mutex-locks bench
Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
First Row: thread number
First Col: critical section length
Values: backoff vs upstream, time based, low is better
non-critical-length: 1
1 2 4 8 16 32 64 112 140
0 0.99 0.58 0.52 0.49 0.43 0.44 0.46 0.52 0.54
1 0.98 0.43 0.56 0.50 0.44 0.45 0.50 0.56 0.57
2 0.99 0.41 0.57 0.51 0.45 0.47 0.48 0.60 0.61
4 0.99 0.45 0.59 0.53 0.48 0.49 0.52 0.64 0.65
8 1.00 0.66 0.71 0.63 0.56 0.59 0.66 0.72 0.71
16 0.97 0.78 0.91 0.73 0.67 0.70 0.79 0.80 0.80
32 0.95 1.17 0.98 0.87 0.82 0.86 0.89 0.90 0.90
64 0.96 0.95 1.01 1.01 0.98 1.00 1.03 0.99 0.99
128 0.99 1.01 1.01 1.17 1.08 1.12 1.02 0.97 1.02
non-critical-length: 32
1 2 4 8 16 32 64 112 140
0 1.03 0.97 0.75 0.65 0.58 0.58 0.56 0.70 0.70
1 0.94 0.95 0.76 0.65 0.58 0.58 0.61 0.71 0.72
2 0.97 0.96 0.77 0.66 0.58 0.59 0.62 0.74 0.74
4 0.99 0.96 0.78 0.66 0.60 0.61 0.66 0.76 0.77
8 0.99 0.99 0.84 0.70 0.64 0.66 0.71 0.80 0.80
16 0.98 0.97 0.95 0.76 0.70 0.73 0.81 0.85 0.84
32 1.04 1.12 1.04 0.89 0.82 0.86 0.93 0.91 0.91
64 0.99 1.15 1.07 1.00 0.99 1.01 1.05 0.99 0.99
128 1.00 1.21 1.20 1.22 1.25 1.31 1.12 1.10 0.99
non-critical-length: 128
1 2 4 8 16 32 64 112 140
0 1.02 1.00 0.99 0.67 0.61 0.61 0.61 0.74 0.73
1 0.95 0.99 1.00 0.68 0.61 0.60 0.60 0.74 0.74
2 1.00 1.04 1.00 0.68 0.59 0.61 0.65 0.76 0.76
4 1.00 0.96 0.98 0.70 0.63 0.63 0.67 0.78 0.77
8 1.01 1.02 0.89 0.73 0.65 0.67 0.71 0.81 0.80
16 0.99 0.96 0.96 0.79 0.71 0.73 0.80 0.84 0.84
32 0.99 0.95 1.05 0.89 0.84 0.85 0.94 0.92 0.91
64 1.00 0.99 1.16 1.04 1.00 1.02 1.06 0.99 0.99
128 1.00 1.06 0.98 1.14 1.39 1.26 1.08 1.02 0.98
There is regression in large critical section. But adaptive mutex is
aimed for "quick" locks. Small critical section is more common when
users choose to use adaptive pthread_mutex.
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|