| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
Replace boolean CAS with value CAS to avoid the extra load.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 0b82747dc48d5bf0871bdc6da8cb6eec1256355f)
|
|
|
|
|
|
|
|
|
|
|
|
| |
libc_map is never reset to NULL, neither during dlclose nor on a
dlopen call which reuses the namespace structure. As a result, if a
namespace is reused, its libc is not initialized properly. The most
visible result is a crash in the <ctype.h> functions.
To prevent similar bugs on namespace reuse from surfacing,
unconditionally initialize the chosen namespace to zero using memset.
(cherry picked from commit d0e357ff45a75553dee3b17ed7d303bfa544f6fe)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Was missing to for the multiarch build rtld-strncmp-sse4_2.os was
being built and exporting symbols:
build/glibc/string/rtld-strncmp-sse4_2.os:
0000000000000000 T __strncmp_sse42
Introduced in:
commit 11ffcacb64a939c10cfc713746b8ec88837f5c4a
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Wed Jun 21 12:10:50 2017 -0700
x86-64: Implement strcmp family IFUNC selectors in C
(cherry picked from commit 96ac447d915ea5ecef3f9168cc13f4e731349a3b)
|
|
|
|
|
|
|
|
| |
The primary memmove_{impl}_unaligned_erms implementations don't
interact with this function. Putting them in same file both
wastes space and unnecessarily bloats a hot code section.
(cherry picked from commit 21925f64730d52eb7d8b2fb62b412f8ab92b0caf)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implementation wise:
1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not
use the L(stosb) label that was previously defined.
2. Don't give the hotpath (fallthrough) to zero size.
Code positioning wise:
Move memset_{chk}_erms to its own file. Leaving it in between the
memset_{impl}_unaligned both adds unnecessary complexity to the
file and wastes space in a relatively hot cache section.
(cherry picked from commit 4a3f29e7e475dd4e7cce2a24c187e6fb7b5b0a05)
|
|
|
|
|
|
| |
This was simply missing and meant we weren't testing it properly.
(cherry picked from commit 2a1099020cdc1e4c9c928156aa85c8cf9d540291)
|
|
|
|
|
|
|
| |
Previously was missing but the two implementations shouldn't get in
the sse2 (generic) text section.
(cherry picked from commit afc6e4328ff80973bde50d5401691b4c4b2e522c)
|
|
|
|
|
|
|
|
|
|
|
| |
The function was tuned around 64-byte entry alignment and performs
better for all sizes with it.
As well different code boths where explicitly written to touch the
minimum number of cache line i.e sizes <= 32 touch only the entry
cache line.
(cherry picked from commit 227afaa67213efcdce6a870ef5086200f1076438)
|
|
|
|
|
|
|
|
|
| |
BMI1/BMI2 are part of the ISA V3 requirements:
https://en.wikipedia.org/wiki/X86-64
And defined by GCC when building with `-march=x86-64-v3`
(cherry picked from commit 8da9f346cb2051844348785b8a932ec44489e0b7)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
Previously was using `__x86_rep_movsb_threshold` and should
have been using `__x86_shared_non_temporal_threshold`.
2. Avoid reloading __x86_shared_non_temporal_threshold before
the L(large_memcpy_4x) bounds check.
3. Document the second bounds check for L(large_memcpy_4x)
more clearly.
(cherry picked from commit 89a25c6f64746732b87eaf433af0964b564d4a92)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.
The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.
The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
(cherry picked from commit b446822b6ae4e8149902a78cdd4a886634ad6321)
|
|
|
|
|
|
|
|
|
|
|
| |
This has been missing since the the ifuncs where added.
The performance of SSE4.2 is preferable to to SSE2.
Measured on Tigerlake with N = 20 runs.
Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
(cherry picked from commit ff439c47173565fbff4f0f78d07b0f14e4a7db05)
|
|
|
|
|
|
|
|
|
|
| |
Move the setting of `rep_movsb_stop_threshold` to after the tunables
have been collected so that the `rep_movsb_stop_threshold` (which
is used to redirect control flow to the non_temporal case) will
use any user value for `non_temporal_threshold` (set using
glibc.cpu.x86_non_temporal_threshold)
(cherry picked from commit 035591551400cfc810b07244a015c9411e8bff7c)
|
|
|
|
|
|
| |
This ensures the load will never split a cache line.
(cherry picked from commit 0f91811333f23b61cf681cab2704b35a0a073b97)
|
|
|
|
|
|
|
|
|
|
| |
Give fall-through path to `vzeroupper` and taken-path to `vzeroall`.
Generally even on machines with RTM the expectation is the
string-library functions will not be called in transactions.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c28db9cb29a7d6cf3ce08fd8445e6b7dea03f35b)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.
The total code size saving is: 64 bytes
There are no non-negligible changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 1.000
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 56da3fe1dd075285fa8186d44b3c28e68c687e62)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.
The total code size saving is: 59 bytes
There are no major changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 0.967
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6dcbb7d95dded20153b12d76d2f4e0ef0cda4f35)
x86: Fix page cross case in rawmemchr-avx2 [BZ #29234]
commit 6dcbb7d95dded20153b12d76d2f4e0ef0cda4f35
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon Jun 6 21:11:33 2022 -0700
x86: Shrink code size of memchr-avx2.S
Changed how the page cross case aligned string (rdi) in
rawmemchr. This was incompatible with how
`L(cross_page_continue)` expected the pointer to be aligned and
would cause rawmemchr to read data start started before the
beginning of the string. What it would read was in valid memory
but could count CHAR matches resulting in an incorrect return
value.
This commit fixes that issue by essentially reverting the changes to
the L(page_cross) case as they didn't really matter.
Test cases added and all pass with the new code (and where confirmed
to fail with the old code).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2c9af8421d2b4a7fcce163e7bc81a118d22fd346)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 306 bytes
Geometric Mean of all benchmarks New / Old: 0.760
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
10-20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 15-45% speedup.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit af5306a735eb0966fdc2f8ccdafa8888e2df0c87)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new code:
1. prioritizes smaller user-arg lengths more.
2. optimizes target placement more carefully
3. reuses logic more
4. fixes up various inefficiencies in the logic. The biggest
case here is the `lzcnt` logic for checking returns which
saves either a branch or multiple instructions.
The total code size saving is: 263 bytes
Geometric Mean of all benchmarks New / Old: 0.755
Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
20% regression.
This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 35% speedup.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b4209615a06b01c974f47b4998b00e4c7b1aa5d9)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new code:
1. prioritizes smaller lengths more.
2. optimizes target placement more carefully.
3. reuses logic more.
4. fixes up various inefficiencies in the logic.
The total code size saving is: 394 bytes
Geometric Mean of all benchmarks New / Old: 0.874
Regressions:
1. The page cross case is now colder, especially re-entry from the
page cross case if a match is not found in the first VEC
(roughly 50%). My general opinion with this patch is this is
acceptable given the "coldness" of this case (less than 4%) and
generally performance improvement in the other far more common
cases.
2. There are some regressions 5-15% for medium/large user-arg
lengths that have a match in the first VEC. This is because the
logic was rewritten to optimize finds in the first VEC if the
user-arg length is shorter (where we see roughly 20-50%
performance improvements). It is not always the case this is a
regression. My intuition is some frontend quirk is partially
explaining the data although I haven't been able to find the
root cause.
Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 731feee3869550e93177e604604c1765d81de571)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The RTM vzeroupper mitigation has no way of replacing inline
vzeroupper not before a return.
This can be useful when hoisting a vzeroupper to save code size
for example:
```
L(foo):
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
VZEROUPPER_RETURN
L(bar):
xorl %eax, %eax
VZEROUPPER_RETURN
```
Can become:
```
L(foo):
COND_VZEROUPPER
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
ret
L(bar):
xorl %eax, %eax
ret
```
This code does not change any existing functionality.
There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit dd5c483b2598f411428df4d8864c15c4b8a3cd68)
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch does not touch any existing code and is only meant to be a
tool for future patches so that simple source files can more easily be
maintained to target multiple VEC classes.
There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 8a780a6b910023e71f3173f37f0793834c047554)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adding a 512-bit EVEX version of strstr. The algorithm works as follows:
(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.
(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.
(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.
Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)
Geometric mean of all benchmarks: new / old = 0.66
Difficult skiptable(0) : new / old = 0.02
Difficult skiptable(1) : new / old = 0.01
Difficult 2-way : new / old = 0.25
Difficult testing first 2 : new / old = 1.26
Difficult skiptable(0) : new / old = 0.05
Difficult skiptable(1) : new / old = 0.06
Difficult 2-way : new / old = 0.26
Difficult testing first 2 : new / old = 1.05
Difficult skiptable(0) : new / old = 0.42
Difficult skiptable(1) : new / old = 0.24
Difficult 2-way : new / old = 0.21
Difficult testing first 2 : new / old = 1.04
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5082a287d5e9a1f9cb98b7c982a708a3684f1d5c)
x86: Remove __mmask intrinsics in strstr-avx512.c
The intrinsics are not available before GCC7 and using standard
operators generates code of equivalent or better quality.
Removed:
_cvtmask64_u64
_kshiftri_mask64
_kand_mask64
Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958
(cherry picked from commit f2698954ff9c2f9626d4bcb5a30eb5729714e0b0)
|
|
|
|
|
|
|
|
|
| |
According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we
can ignore their r_addends.
Reviewed-by: Fangrui Song <maskray@google.com>
(cherry picked from commit f8587a61892cbafd98ce599131bf4f103466f084)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch implements following evex512 version of string functions.
Perf gain for evex512 version is up to 50% as compared to evex,
depending on length and alignment.
Placeholder function, not used by any processor at the moment.
- String length function using 512 bit vectors.
- String N length using 512 bit vectors.
- Wide string length using 512 bit vectors.
- Wide string N length using 512 bit vectors.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 9c66efb86fe384f77435f7e326333fb2e4e10676)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Both symbols are marked as legacy in POSIX.1-2001 and removed on
POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE
or _DEFAULT_SOURCE.
GCC also replaces bcopy with a memmove and bzero with memset on default
configuration (to actually get a bzero libc call the code requires
to omit string.h inclusion and built with -fno-builtin), so it is
highly unlikely programs are actually calling libc bzero symbol.
On a recent Linux distro (Ubuntu 22.04), there is no bzero calls
by the installed binaries.
$ cat count_bstring.sh
#!/bin/bash
files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done`
total=0
for file in $files; do
symbols=`objdump -R $file 2>&1`
if [ $? -eq 0 ]; then
ncalls=`echo $symbols | grep -w $1 | wc -l`
((total=total+ncalls))
if [ $ncalls -gt 0 ]; then
echo "$file: $ncalls"
fi
fi
done
echo "TOTAL=$total"
$ ./count_bstring.sh bzero
TOTAL=0
Checked on x86_64-linux-gnu.
(cherry picked from commit 9403b71ae97e3f1a91c796ddcbb4e6f044434734)
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit remove trailing space introduced by following commit.
commit a775a7a3eb1e85b54af0b4ee5ff4dcf66772a1fb
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed Jun 23 01:56:29 2021 -0400
x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]
(cherry picked from commit 8d324019e69203f5998f223d0e905de1395330ea)
|
|
|
|
|
|
|
|
|
|
| |
Separated debuginfo files have PT_DYNAMIC with p_filesz == 0. We
need to check for that before the _dl_map_segments call because
that could attempt to write to mappings that extend beyond the end
of the file, resulting in SIGBUS.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit ea32ec354c65ddad11b82ca9d057010df13a9cea)
|
|
|
|
|
|
|
|
|
|
| |
On success, mq_receive() and mq_timedreceive() return the number of
bytes in the received message, so it requires to check if the value
is larger than 0.
Checked on i686-linux-gnu.
(cherry picked from commit 71d87d85bf54f6522813aec97c19bdd24997341e)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
__strncpy_power9 initializes VR 18 with zeroes to be used throughout the
code, including when zero-padding the destination string. However, the
v18 reference was mistakenly being used for stxv and stxvl, which take a
VSX vector as operand. The code ended up using the uninitialized VSR 18
register by mistake.
Both occurrences have been changed to use the proper VSX number for VR 18
(i.e. VSR 50).
Tested on powerpc, powerpc64 and powerpc64le.
Signed-off-by: Kewen Lin <linkw@gcc.gnu.org>
(cherry picked from commit 0218463dd8265ed937622f88ac68c7d984fe0cfc)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Re-cherry-pick commit c627209832 for strcmp-avx2.S change which was
omitted in intial cherry pick because at the time this bug was not
present on release branch.
Fixes BZ #29127.
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c6272098323153db373f2986c67786ea8c85f1cf)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.
Geometric Mean of all benchmarks New / Old: 0.755
See email for all results.
Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c966099cdc3e0fdf92f63eac09b22fa7e5f5f02d)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.
Geometric Mean of all benchmarks New / Old: 0.832
See email for all results.
Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit df7e295d18ffa34f629578c0017a9881af7620f6)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.
Geometric Mean of all benchmarks New / Old: 0.741
See email for all results.
Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5307aa9c1800f36a64c183c091c9af392c1fa75c)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Old code was both inefficient and wasted code size. New code (-62
bytes) and comparable or better performance in the page cross case.
geometric_mean(N=20) of page cross cases New / Original: 0.960
size, align0, align1, ret, New Time/Old Time
1, 4095, 0, 0, 1.001
1, 4095, 0, 1, 0.999
1, 4095, 0, -1, 1.0
2, 4094, 0, 0, 1.0
2, 4094, 0, 1, 1.0
2, 4094, 0, -1, 1.0
3, 4093, 0, 0, 1.0
3, 4093, 0, 1, 1.0
3, 4093, 0, -1, 1.0
4, 4092, 0, 0, 0.987
4, 4092, 0, 1, 1.0
4, 4092, 0, -1, 1.0
5, 4091, 0, 0, 0.984
5, 4091, 0, 1, 1.002
5, 4091, 0, -1, 1.005
6, 4090, 0, 0, 0.993
6, 4090, 0, 1, 1.001
6, 4090, 0, -1, 1.003
7, 4089, 0, 0, 0.991
7, 4089, 0, 1, 1.0
7, 4089, 0, -1, 1.001
8, 4088, 0, 0, 0.875
8, 4088, 0, 1, 0.881
8, 4088, 0, -1, 0.888
9, 4087, 0, 0, 0.872
9, 4087, 0, 1, 0.879
9, 4087, 0, -1, 0.883
10, 4086, 0, 0, 0.878
10, 4086, 0, 1, 0.886
10, 4086, 0, -1, 0.873
11, 4085, 0, 0, 0.878
11, 4085, 0, 1, 0.881
11, 4085, 0, -1, 0.879
12, 4084, 0, 0, 0.873
12, 4084, 0, 1, 0.889
12, 4084, 0, -1, 0.875
13, 4083, 0, 0, 0.873
13, 4083, 0, 1, 0.863
13, 4083, 0, -1, 0.863
14, 4082, 0, 0, 0.838
14, 4082, 0, 1, 0.869
14, 4082, 0, -1, 0.877
15, 4081, 0, 0, 0.841
15, 4081, 0, 1, 0.869
15, 4081, 0, -1, 0.876
16, 4080, 0, 0, 0.988
16, 4080, 0, 1, 0.99
16, 4080, 0, -1, 0.989
17, 4079, 0, 0, 0.978
17, 4079, 0, 1, 0.981
17, 4079, 0, -1, 0.98
18, 4078, 0, 0, 0.981
18, 4078, 0, 1, 0.98
18, 4078, 0, -1, 0.985
19, 4077, 0, 0, 0.977
19, 4077, 0, 1, 0.979
19, 4077, 0, -1, 0.986
20, 4076, 0, 0, 0.977
20, 4076, 0, 1, 0.986
20, 4076, 0, -1, 0.984
21, 4075, 0, 0, 0.977
21, 4075, 0, 1, 0.983
21, 4075, 0, -1, 0.988
22, 4074, 0, 0, 0.983
22, 4074, 0, 1, 0.994
22, 4074, 0, -1, 0.993
23, 4073, 0, 0, 0.98
23, 4073, 0, 1, 0.992
23, 4073, 0, -1, 0.995
24, 4072, 0, 0, 0.989
24, 4072, 0, 1, 0.989
24, 4072, 0, -1, 0.991
25, 4071, 0, 0, 0.99
25, 4071, 0, 1, 0.999
25, 4071, 0, -1, 0.996
26, 4070, 0, 0, 0.993
26, 4070, 0, 1, 0.995
26, 4070, 0, -1, 0.998
27, 4069, 0, 0, 0.993
27, 4069, 0, 1, 0.999
27, 4069, 0, -1, 1.0
28, 4068, 0, 0, 0.997
28, 4068, 0, 1, 1.0
28, 4068, 0, -1, 0.999
29, 4067, 0, 0, 0.996
29, 4067, 0, 1, 0.999
29, 4067, 0, -1, 0.999
30, 4066, 0, 0, 0.991
30, 4066, 0, 1, 1.001
30, 4066, 0, -1, 0.999
31, 4065, 0, 0, 0.988
31, 4065, 0, 1, 0.998
31, 4065, 0, -1, 0.998
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 23102686ec67b856a2d4fd25ddaa1c0b8d175c4f)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Code didn't actually use any sse4 instructions since `ptest` was
removed in:
commit 2f9062d7171850451e6044ef78d91ff8c017b9c0
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed Nov 10 16:18:56 2021 -0600
x86: Shrink memcmp-sse4.S code size
The new memcmp-sse2 implementation is also faster.
geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905
Note there are two regressions preferring SSE2 for Size = 1 and Size =
65.
Size = 1:
size, align0, align1, ret, New Time/Old Time
1, 1, 1, 0, 1.2
1, 1, 1, 1, 1.197
1, 1, 1, -1, 1.2
This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).
Python3 Size = 1 -> 13.64%
Python3 Size = [4, 8] -> 60.92%
GCC11 Size = 1 -> 1.29%
GCC11 Size = [4, 8] -> 33.86%
size, align0, align1, ret, New Time/Old Time
4, 4, 4, 0, 0.622
4, 4, 4, 1, 0.797
4, 4, 4, -1, 0.805
5, 5, 5, 0, 0.623
5, 5, 5, 1, 0.777
5, 5, 5, -1, 0.802
6, 6, 6, 0, 0.625
6, 6, 6, 1, 0.813
6, 6, 6, -1, 0.788
7, 7, 7, 0, 0.625
7, 7, 7, 1, 0.799
7, 7, 7, -1, 0.795
8, 8, 8, 0, 0.625
8, 8, 8, 1, 0.848
8, 8, 8, -1, 0.914
9, 9, 9, 0, 0.625
Size = 65:
size, align0, align1, ret, New Time/Old Time
65, 0, 0, 0, 1.103
65, 0, 0, 1, 1.216
65, 0, 0, -1, 1.227
65, 65, 0, 0, 1.091
65, 0, 65, 1, 1.19
65, 65, 65, -1, 1.215
This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.
size, align0, align1, ret, New Time/Old Time
128, 4, 8, 0, 0.858
128, 4, 8, 1, 0.879
128, 4, 8, -1, 0.888
As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7cbc03d03091d5664060924789afe46d30a5477e)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Just a few QOL changes.
1. Prefer `add` > `lea` as it has high execution units it can run
on.
2. Don't break macro-fusion between `test` and `jcc`
3. Reduce code size by removing gratuitous padding bytes (-90
bytes).
geometric_mean(N=20) of all benchmarks New / Original: 0.959
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 244b415d386487521882debb845a040a4758cb18)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The rational is:
1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
regression on Tigerlake using SSE42 versus AVX across the
benchtest suite).
2. AVX2 version covers the majority of targets that previously
prefered it.
3. The targets where AVX would still be best (SnB and IVB) are
becoming outdated.
All in all the saving the code size is worth it.
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 305769b2a15c2e96f9e1b5195d3c4e0d6f0f4b68)
|
|
|
|
|
|
|
|
|
| |
geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 84e7c46df4086873eae28a1fb87d2cf5388b1e16)
|
|
|
|
|
|
|
|
|
| |
geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit bbf81222343fed5cd704001a2ae0d86c71544151)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Slightly faster method of doing TOLOWER that saves an
instruction.
Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.
geometric_mean(N=40) of all benchmarks New / Original: .920
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit d154758e618ec9324f5d339c46db0aa27e8b1226)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Slightly faster method of doing TOLOWER that saves an
instruction.
Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.
geometric_mean(N=40) of all benchmarks New / Original: .894
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 670b54bc585ea4a94f3b2e9272ba44aa6b730b73)
|
|
|
|
|
|
|
|
|
|
|
| |
The generic implementation is faster.
geometric_mean(N=20) of all benchmarks New / Original: .710
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 9c8a6ad620b49a27120ecdd7049c26bf05900397)
|
|
|
|
|
|
|
|
|
| |
The generic implementation is faster (see strcspn commit).
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 653358535280a599382cb6c77538a187dac6a87f)
|
|
|
|
|
|
|
|
|
|
|
| |
The generic implementation is faster.
geometric_mean(N=20) of all benchmarks New / Original: .678
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit fe28e7d9d9535ebab4081d195c553b4fbf39d9ae)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.
geometric_mean(N=20) of all benchmarks that dont fallback on
sse2; New / Original: .901
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 412d10343168b05b8cf6c3683457cf9711d28046)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.
geometric_mean(N=20) of all benchmarks that dont fallback on
sse2/strlen; New / Original: .928
All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 30d627d477d7255345a4b713cf352ac32d644d61)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Small code cleanup for size: -81 bytes.
Add comment justifying using a branch to do NULL/non-null return.
All string/memory tests pass and no regressions in benchtests.
geometric_mean(N=20) of all benchmarks New / Original: .985
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit ec285ea90415458225623ddc0492ae3f705af043)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Small code cleanup for size: -53 bytes.
Add comment justifying using a branch to do NULL/non-null return.
All string/memory tests pass and no regressions in benchtests.
geometric_mean(N=20) of all benchmarks Original / New: 1.00
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit a6fbf4d51e9ba8063c4f8331564892ead9c67344)
|
|
|
|
|
|
|
| |
The symbols is not present in current POSIX specification and compiler
already generates memmove call.
(cherry picked from commit bf92893a14ebc161b08b28acc24fa06ae6be19cb)
|