| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
| |
Don't test a64fx string functions when BTI is enabled since they are
not BTI compatible.
|
|
|
|
|
|
|
|
|
|
| |
The latest implementations of memcpy are actually faster than the Falkor
implementations [1], so remove the falkor/phecda ifuncs for memcpy and
the now unused IS_FALKOR/IS_PHECDA defines.
[1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
|
|
|
|
| |
Add a specialized memset for the common ZVA size of 64 to avoid the
overhead of reading the ZVA size. Since the code is identical to
__memset_falkor, remove the latter.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
|
|
|
| |
Cleanup emag memset - merge the memset_base64.S file, remove
the unused ZVA code (since it is disabled on emag).
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
|
|
|
|
| |
Cleanup ifuncs. Remove uses of libc_hidden_builtin_def, use ENTRY rather than
ENTRY_ALIGN, remove unnecessary defines and conditional compilation. Rename
strlen_mte to strlen_generic. Remove rtld-memset.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
|
|
|
|
|
|
|
| |
Add support for MOPS in cpu_features and INIT_ARCH. Add ifuncs using MOPS for
memcpy, memmove and memset (use .inst for now so it works with all binutils
versions without needing complex configure and conditional compilation).
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
| |
|
|
|
|
|
|
|
|
| |
Improve SVE memcpy by copying 2 vectors if the size is small enough.
This improves performance of random memcpy by ~9% on Neoverse V1, and
33-64 byte copies are ~16% faster.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
|
|
|
|
|
|
| |
Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment.
Performance improves slightly as a result.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
| |
|
|
|
|
|
|
| |
Since __memcpy_simd is the fastest memcpy on almost all cores, replace
the generic memcpy with it. If SVE is available, a SVE memcpy will be
used by default (including for Neoverse N2).
|
|
|
|
|
| |
Cleanup memset ifunc selectors. The A64FX memset relies on a ZVA size of
256, so add an explicit check.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Besides the option being gcc specific, this approach is still fragile
and not future proof since we do not know if this will be the only
optimization option gcc will add that transforms loops to memset
(or any libcall).
This patch adds a new header, dl-symbol-redir-ifunc.h, that can b
used to redirect the compiler generated libcalls to port the generic
memset implementation if required.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC
redundant and fixes several targets that will write outside the array.
To avoid unnecessary large diffs, pass the maximum in the argument 'i' to
IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing
ones can be updated if desired.
Passes buildmanyglibc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
| |
Sort makefile entries to reduce conflicts.
|
|
|
|
|
|
| |
Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE
vectors which improves the random memcpy benchmark significantly.
Cleanup the memcpy and memmove ifunc selectors.
|
|
|
|
|
| |
Add a check for SVE in the A64FX ifuncs for memcpy, memset and memmove.
This fixes BZ #28744.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.
I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah. I don't
know why I run into these diagnostics whereas others evidently do not.
remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
v2 is a complete rewrite of the A64FX memcpy. Performance is improved
by streamlining the code, aligning all large copies and using a single
unrolled loop for all sizes. The code size for memcpy and memmove goes
down from 1796 bytes to 868 bytes. Performance is better in all cases:
bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33%
faster for large sizes, bench-memcpy-walk is 25% faster for small sizes
and 20% for the largest sizes. The geomean of all tests in bench-memcpy
is 5.1% faster, and total time is reduced by 4%.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
|
|
|
|
|
|
|
|
| |
This patch disables A64FX memcpy/memmove BTI instruction insertion
unconditionally such as A64FX memset patch [1] for performance.
[1] commit 07b427296b8d59f439144029d9a948f6c1ce0a31
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
|
|
|
|
|
|
|
|
|
| |
This patch updates unroll8 code so as not to degrade at the peak
performance 16KB for both FX1000 and FX700.
Inserted 2 instructions at the beginning of the unroll8 loop,
cmp and branch, are a workaround that is found heuristically.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
|
|
|
|
|
|
| |
Because of wrong commit author. Will recommit it with right author.
This reverts commit 23777232c23f80809613bdfa329f63aadf992922.
|
|
|
|
|
|
|
|
|
|
| |
This patch updates unroll8 code so as not to degrade at the peak
performance 16KB for both FX1000 and FX700.
Inserted 2 instructions at the beginning of the unroll8 loop,
cmp and branch, are a workaround that is found heuristically.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
|
|
|
|
|
|
|
| |
Simplify the code for memsets smaller than L1. Improve the unroll8 and
L1_prefetch loops.
Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
|
|
|
|
|
|
| |
Remove unroll32 code since it doesn't improve performance.
Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
|
|
|
|
|
|
|
| |
Simplify handling of remaining bytes. Avoid lots of taken branches and complex
whilelo computations, instead unconditionally write vectors from the end.
Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
|
|
|
|
|
|
|
|
| |
Improve performance of large memsets. Simplify alignment code. For zero memset
use DC ZVA, which almost doubles performance. For non-zero memsets use the
unroll8 loop which is about 10% faster.
Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
|
|
|
|
|
|
|
|
| |
Improve performance of small memsets by reducing instruction counts and
improving code alignment. Bench-memset shows 35-45% performance gain for
small sizes.
Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.
The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.
SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.
We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.
And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.
[1] https://github.com/fujitsu/A64FX
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.
The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, and software pipelining.
SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.
We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.
And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.
[1] https://github.com/fujitsu/A64FX
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some IFUNC variants are not compatible with BTI and MTE so don't
set them as usable for testing and benchmarking on a BTI or MTE
enabled system.
As far as IFUNC selectors are concerned a system is BTI enabled if
the cpu supports it and glibc was built with BTI branch protection.
Most IFUNC variants are BTI compatible, but thunderx2 memcpy and
memmove use a jump table with indirect jump, without a BTI j.
Fixes bug 26818.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The hwcap value is now in linux 5.10 and in glibc bits/hwcap.h, so use
that definition.
Move the definition to init-arch.h so all ifunc selectors can use it
and expose an "mte" shorthand for mte enabled runtime.
For now we allow user code to enable tag checks and use PROT_MTE
mappings without libc involvment, this is not guaranteed ABI, but
can be useful for testing and debugging with MTE.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is added to default memcpy version,
however, there is performance degradation when the copy size is some large bytes, eg: 100k.
This is the result, tested in glibc-2.28:
before backport after backport Performance improvement
memcpy_1k 0.005 0.005 0.00%
memcpy_10k 0.032 0.029 10.34%
memcpy_100k 0.356 0.429 -17.02%
memcpy_1m 7.470 11.153 -33.02%
This is the demo
#include "stdio.h"
#include "string.h"
#include "stdlib.h"
char a[1024*1024] = {12};
char b[1024*1024] = {13};
int main(int argc, char *argv[])
{
int i = atoi(argv[1]);
int j;
int size = atoi(argv[2]);
for (j = 0; j < i; j++)
memcpy(b, a, size*1024);
return 0;
}
# gcc -g -O0 memcpy.c -o memcpy
# time taskset -c 10 ./memcpy 100000 1024
Co-authored-by: liqingqing <liqingqing3@huawei.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DELOUSE was added to asm code to make them compatible with non-LP64
ABIs, but it is an unfortunate name and the code was not compatible
with ABIs where pointer and size_t are different. Glibc currently
only supports the LP64 ABI so these macros are not really needed or
tested, but for now the name is changed to be more meaningful instead
of removing them completely.
Some DELOUSE macros were dropped: clone, strlen and strnlen used it
unnecessarily.
The out of tree ILP32 patches are currently not maintained and will
likely need a rework to rebase them on top of the time64 changes.
|
|
|
|
|
|
| |
This symbol is not in the implementation reserved namespace for static
linking and it was never used: it seems it was mistakenly added in the
orignal strlen_asimd commit 436e4d5b965abe592d26150cb518accf9ded8fe4
|
|
|
|
|
|
|
| |
Add CPU detection of Neoverse N2 and Neoverse V1, and select __memcpy_simd as
the memcpy/memmove ifunc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
|
|
|
|
| |
On some microarchitectures performance of the backwards memmove improves if
the stores use STR with decreasing addresses. So change the memmove loop
in memcpy_advsimd.S to use 2x STR rather than STP.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Make glibc MTE-safe on systems where MTE is available. This allows
using heap tagging with an LD_PRELOADed malloc implementation that
enables MTE. We don't document this as guaranteed contract yet, so
glibc may not be MTE safe when HWCAP2_MTE is set (older glibcs
certainly aren't). This is mainly for testing and debugging.
The HWCAP flag is not exposed in public headers until Linux adds it
to its uapi. The HWCAP value reservation will be in Linux 5.9.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Optimize strlen using a mix of scalar and SIMD code. On modern micro
architectures large strings are 2.6 times faster than existing
strlen_asimd and 35% faster than the new MTE version of strlen.
On a random strlen benchmark using small sizes the speedup is 7% vs
strlen_asimd and 40% vs the MTE strlen. This fixes the main strlen
regressions on Cortex-A53 and other cores with a simple Neon unit.
Rename __strlen_generic to __strlen_mte, and select strlen_asimd when
MTE is not enabled (this is waiting on support for a HWCAP_MTE bit).
This fixes big-endian bug 25824. Passes GLIBC regression tests.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
|
|
|
|
|
| |
Rename IS_ARES to IS_NEOVERSE_N1 since that is a bit clearer.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a new memcpy using 128-bit Q registers - this is faster on modern
cores and reduces codesize. Similar to the generic memcpy, small cases
include copies up to 32 bytes. 64-128 byte copies are split into two
cases to improve performance of 64-96 byte copies. Large copies align
the source rather than the destination.
bench-memcpy-random is ~9% faster than memcpy_falkor on Neoverse N1,
so make this memcpy the default on N1 (on Centriq it is 15% faster than
memcpy_falkor).
Passes GLIBC regression tests.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To enable building glibc with branch protection, assembly code
needs BTI landing pads and ELF object file markings in the form
of a GNU property note.
The landing pads are unconditionally added to all functions that
may be indirectly called. When the code segment is not mapped
with PROT_BTI these instructions are nops. They are kept in the
code when BTI is not supported so that the layout of performance
critical code is unchanged across configurations.
The GNU property notes are only added when there is support for
BTI in the toolchain, because old binutils does not handle the
notes right. (Does not know how to merge them nor to put them in
PT_GNU_PROPERTY segment instead of PT_NOTE, and some versions
of binutils emit warnings about the unknown GNU property. In
such cases the produced libc binaries would not have valid
ELF marking so BTI would not be enabled.)
Note: functions using ENTRY or ENTRY_ALIGN now start with an
additional BTI c, so alignment of the following code changes,
but ENTRY_ALIGN_AND_PAD was fixed so there is no change to the
existing code layout. Some string functions may need to be
tuned for optimal performance after this commit.
Co-authored-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Falkor's memcpy and memmove share some implementation details,
therefore, the two routines are moved to a single source file
for code reuse.
The two routines now share code for small and medium copies
(up to and including 128 bytes). Large copies in memcpy do not
handle overlap correctly, consequently, the loops for
moving/copying more than 128 bytes stay separate for memcpy
and memmove.
To increase code reuse a number of small modifications were made:
1. The old implementation of memcpy copied the first 16-bytes as
soon as the size of data was determined to be greater than 32 bytes.
For memcpy code to also work when copying small/medium overlapping
data, the first load and store was moved to the large copy case.
2. Medium memcpy case no longer assumes that 16 bytes were already
copied and uses 8 registers to copy up to 128 bytes.
3. Small case for memmove was enlarged to that of memcpy, which is
less than or equal to 32 bytes.
4. Medium case for memmove was enlarged to that of memcpy, which is
less than or equal to 128 bytes.
Other changes include:
1. Improve alignment of existing loop bodies.
2. 'Delouse' memmove and memcpy input arguments. Make sure that
upper 32-bits of input registers are zeroed if unused.
3. Do one more iteration in memmove loops and reduce the number of
copies made from the start/end of the buffer, depending on
the direction of the memmove loop.
Benchmarking:
Looking at the results from bench-memcpy-random.out, we can see that
now memmove_falkor is about 5% faster than memcpy_falkor_old, while
memmove_falkor_old was more than 15% slower. The memcpy implementation
remained largely unmodified, so there is no significant performance
change.
The reason for such a significant memmove performance gain is the
increase of the upper bound on the small copy case to 32 bytes and
the increase of the upper bound on the medium copy case to 128 bytes.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
| |
|
|
|
|
| |
Checked on aarch64-linux-gnu.
|
|
|
|
|
|
|
| |
Rename ifunc for kunpeng to kunpeng920, and modify the corresponding
function files including IS_KUNPENG920 judgement.
Checked on aarch64-linux-gnu.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to the branch prediction issue of Kunpeng processor, we found
memset_generic has poor performance on middle sizes setting, and so
we reconstructed the logic, expanded the loop by 4 times in set_long
to solve the problem, even when setting below 1K sizes have benefit.
Another change is that DZ_ZVA seems no work when setting zero, so we
discarded it and used set_long to set zero instead. Fewer branches and
predictions also make the zero case have slightly improvement.
Checked on aarch64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Optimize the strlen implementation by using vector operations and
loop unrolling in main loop.Compared to __strlen_generic,it reduces
latency of cases in bench-strlen by 7%~18% when the length of src
is greater than 128 bytes, with gains throughout the benchmark.
Checked on aarch64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
|