aarch64,falkor: Ignore prefetcher tagging for smaller copies - mirror/glibc - mirror of git://sourceware.org/git/glibc.git

diff options

author	Siddhesh Poyarekar <siddhesh@sourceware.org>	2018-05-11 00:11:52 +0530
committer	Siddhesh Poyarekar <siddhesh@sourceware.org>	2018-05-11 00:11:52 +0530
commit	db725a458e1cb0e17204daa543744faf08bb2e06 (patch)
tree	fc19e9be431ff0128b7bdd6ea3f46609ec0cf303 /include
parent	70c97f8493ab2a215c2543d78f212abb23f151ed (diff)
download	glibc-db725a458e1cb0e17204daa543744faf08bb2e06.tar.gz glibc-db725a458e1cb0e17204daa543744faf08bb2e06.tar.xz glibc-db725a458e1cb0e17204daa543744faf08bb2e06.zip

aarch64,falkor: Ignore prefetcher tagging for smaller copies

For smaller and medium sized copies, the effect of hardware
prefetching are not as dominant as instruction level parallelism.
Hence it makes more sense to load data into multiple registers than to
try and route them to the same prefetch unit.  This is also the case
for the loop exit where we are unable to latch on to the same prefetch
unit anyway so it makes more sense to have data loaded in parallel.

The performance results are a bit mixed with memcpy-random, with
numbers jumping between -1% and +3%, i.e. the numbers don't seem
repeatable.  memcpy-walk sees a 70% improvement (i.e. > 2x) for 128
bytes and that improvement reduces down as the impact of the tail copy
decreases in comparison to the loop.

	* sysdeps/aarch64/multiarch/memcpy_falkor.S (__memcpy_falkor):
	Use multiple registers to copy data in loop tail.

Diffstat (limited to 'include')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: