x86: Improve large memset perf with non-temporal stores [RHEL-29312] - mirror/glibc - mirror of git://sourceware.org/git/glibc.git

diff options

author	Noah Goldstein <goldstein.w.n@gmail.com>	2024-05-24 12:38:50 -0500
committer	Noah Goldstein <goldstein.w.n@gmail.com>	2024-05-30 12:36:09 -0500
commit	5bf0ab80573d66e4ae5d94b094659094336da90f (patch)
tree	5221932dc8f91a0e79255f9224a33a0b6a309505 /elf/dl-tls.c
parent	53f9d74322c831c76bc6cf6ed8941267e8749604 (diff)
download	glibc-5bf0ab80573d66e4ae5d94b094659094336da90f.tar.gz glibc-5bf0ab80573d66e4ae5d94b094659094336da90f.tar.xz glibc-5bf0ab80573d66e4ae5d94b094659094336da90f.zip

x86: Improve large memset perf with non-temporal stores [RHEL-29312]

Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2 : 0.986
Geometric Mean across the suite New / Old SSE2 : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Diffstat (limited to 'elf/dl-tls.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: