diff options
author | Noah Goldstein <goldstein.w.n@gmail.com> | 2022-04-15 12:28:00 -0500 |
---|---|---|
committer | Sunil K Pandey <skpgkp2@gmail.com> | 2022-05-16 20:46:04 -0700 |
commit | 95bbfc035140fe1ac85cafc8f0a3b85424c06897 (patch) | |
tree | 1c06f4e06edede5c6db697bc78afa32036eb5289 /sysdeps/x86_64/multiarch/Makefile | |
parent | aa4b53b4c041c522892a800b9e17a364a443c447 (diff) | |
download | glibc-95bbfc035140fe1ac85cafc8f0a3b85424c06897.tar.gz glibc-95bbfc035140fe1ac85cafc8f0a3b85424c06897.tar.xz glibc-95bbfc035140fe1ac85cafc8f0a3b85424c06897.zip |
x86: Remove memcmp-sse4.S
Code didn't actually use any sse4 instructions since `ptest` was removed in: commit 2f9062d7171850451e6044ef78d91ff8c017b9c0 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Nov 10 16:18:56 2021 -0600 x86: Shrink memcmp-sse4.S code size The new memcmp-sse2 implementation is also faster. geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905 Note there are two regressions preferring SSE2 for Size = 1 and Size = 65. Size = 1: size, align0, align1, ret, New Time/Old Time 1, 1, 1, 0, 1.2 1, 1, 1, 1, 1.197 1, 1, 1, -1, 1.2 This is intentional. Size == 1 is significantly less hot based on profiles of GCC11 and Python3 than sizes [4, 8] (which is made hotter). Python3 Size = 1 -> 13.64% Python3 Size = [4, 8] -> 60.92% GCC11 Size = 1 -> 1.29% GCC11 Size = [4, 8] -> 33.86% size, align0, align1, ret, New Time/Old Time 4, 4, 4, 0, 0.622 4, 4, 4, 1, 0.797 4, 4, 4, -1, 0.805 5, 5, 5, 0, 0.623 5, 5, 5, 1, 0.777 5, 5, 5, -1, 0.802 6, 6, 6, 0, 0.625 6, 6, 6, 1, 0.813 6, 6, 6, -1, 0.788 7, 7, 7, 0, 0.625 7, 7, 7, 1, 0.799 7, 7, 7, -1, 0.795 8, 8, 8, 0, 0.625 8, 8, 8, 1, 0.848 8, 8, 8, -1, 0.914 9, 9, 9, 0, 0.625 Size = 65: size, align0, align1, ret, New Time/Old Time 65, 0, 0, 0, 1.103 65, 0, 0, 1, 1.216 65, 0, 0, -1, 1.227 65, 65, 0, 0, 1.091 65, 0, 65, 1, 1.19 65, 65, 65, -1, 1.215 This is because A) the checks in range [65, 96] are now unrolled 2x and B) because smaller values <= 16 are now given a hotter path. By contrast the SSE4 version has a branch for Size = 80. The unrolled version has get better performance for returns which need both comparisons. size, align0, align1, ret, New Time/Old Time 128, 4, 8, 0, 0.858 128, 4, 8, 1, 0.879 128, 4, 8, -1, 0.888 As well, out of microbenchmark environments that are not full predictable the branch will have a real-cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 7cbc03d03091d5664060924789afe46d30a5477e)
Diffstat (limited to 'sysdeps/x86_64/multiarch/Makefile')
-rw-r--r-- | sysdeps/x86_64/multiarch/Makefile | 2 |
1 files changed, 0 insertions, 2 deletions
diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile index bca82e38d8..b503e4b81e 100644 --- a/sysdeps/x86_64/multiarch/Makefile +++ b/sysdeps/x86_64/multiarch/Makefile @@ -11,7 +11,6 @@ sysdep_routines += \ memcmp-avx2-movbe-rtm \ memcmp-evex-movbe \ memcmp-sse2 \ - memcmp-sse4 \ memcmp-ssse3 \ memcpy-ssse3 \ memcpy-ssse3-back \ @@ -174,7 +173,6 @@ sysdep_routines += \ wmemcmp-avx2-movbe-rtm \ wmemcmp-c \ wmemcmp-evex-movbe \ - wmemcmp-sse4 \ wmemcmp-ssse3 \ # sysdep_routines endif |