about summary refs log tree commit diff
Commit message (Collapse)AuthorAgeFilesLines
* Force DT_RPATH for --enable-hardcoded-path-in-tests release/2.33/masterH.J. Lu2024-05-101-2/+5
| | | | | | | | | | | | | | | | | | | | On Fedora 40/x86-64, linker enables --enable-new-dtags by default which generates DT_RUNPATH instead of DT_RPATH. Unlike DT_RPATH, DT_RUNPATH only applies to DT_NEEDED entries in the executable and doesn't applies to DT_NEEDED entries in shared libraries which are loaded via DT_NEEDED entries in the executable. Some glibc tests have libstdc++.so.6 in DT_NEEDED, which has libm.so.6 in DT_NEEDED. When DT_RUNPATH is generated, /lib64/libm.so.6 is loaded for such tests. If the newly built glibc is older than glibc 2.36, these tests fail with assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_2.36' not found (required by /lib64/libm.so.6) assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by /lib64/libm.so.6) Pass -Wl,--disable-new-dtags to linker when building glibc tests with --enable-hardcoded-path-in-tests. This fixes BZ #31719. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 2dcaf70643710e22f92a351e36e3cff8b48c60dc)
* elf: Disable some subtests of ifuncmain1, ifuncmain5 for !PIEFlorian Weimer2024-05-092-0/+22
| | | | (cherry picked from commit 9cc9d61ee12f2f8620d8e0ea3c42af02bf07fe1e)
* nscd: Use time_t for return type of addgetnetgrentXFlorian Weimer2024-05-031-2/+2
| | | | | | | | | | | | Using int may give false results for future dates (timeouts after the year 2028). Fixes commit 04a21e050d64a1193a6daab872bca2528bda44b ("CVE-2024-33601, CVE-2024-33602: nscd: netgroup: Use two buffers in addgetnetgrentX (bug 31680)"). Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 4bbca1a44691a6e9adcee5c6798a707b626bc331)
* CVE-2024-33601, CVE-2024-33602: nscd: netgroup: Use two buffers in ↵Florian Weimer2024-04-251-98/+121
| | | | | | | | | | | | | | | | | | addgetnetgrentX (bug 31680) This avoids potential memory corruption when the underlying NSS callback function does not use the buffer space to store all strings (e.g., for constant strings). Instead of custom buffer management, two scratch buffers are used. This increases stack usage somewhat. Scratch buffer allocation failure is handled by return -1 (an invalid timeout value) instead of terminating the process. This fixes bug 31679. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org> (cherry picked from commit c04a21e050d64a1193a6daab872bca2528bda44b)
* CVE-2024-33600: nscd: Avoid null pointer crashes after notfound response ↵Florian Weimer2024-04-251-4/+7
| | | | | | | | | | | | | | | | | | (bug 31678) The addgetnetgrentX call in addinnetgrX may have failed to produce a result, so the result variable in addinnetgrX can be NULL. Use db->negtimeout as the fallback value if there is no result data; the timeout is also overwritten below. Also avoid sending a second not-found response. (The client disconnects after receiving the first response, so the data stream did not go out of sync even without this fix.) It is still beneficial to add the negative response to the mapping, so that the client can get it from there in the future, instead of going through the socket. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org> (cherry picked from commit b048a482f088e53144d26a61c390bed0210f49f2)
* CVE-2024-33600: nscd: Do not send missing not-found response in ↵Florian Weimer2024-04-251-8/+6
| | | | | | | | | | addgetnetgrentX (bug 31678) If we failed to add a not-found response to the cache, the dataset point can be null, resulting in a null pointer dereference. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org> (cherry picked from commit 7835b00dbce53c3c87bbbb1754a95fb5e58187aa)
* CVE-2024-33599: nscd: Stack-based buffer overflow in netgroup cache (bug 31677)Florian Weimer2024-04-251-2/+3
| | | | | | | | Using alloca matches what other caches do. The request length is bounded by MAXKEYLEN. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 87801a8fd06db1d654eea3e4f7626ff476a9bdaa)
* iconv: ISO-2022-CN-EXT: fix out-of-bound writes when writing escape sequence ↵Charles Fol2024-04-173-1/+144
| | | | | | | | | | | | | | | | | | (CVE-2024-2961) ISO-2022-CN-EXT uses escape sequences to indicate character set changes (as specified by RFC 1922). While the SOdesignation has the expected bounds checks, neither SS2designation nor SS3designation have its; allowing a write overflow of 1, 2, or 3 bytes with fixed values: '$+I', '$+J', '$+K', '$+L', '$+M', or '$*H'. Checked on aarch64-linux-gnu. Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit f9dc609e06b1136bb0408be9605ce7973a767ada)
* aarch64: Use memcpy_simd as the default memcpyWilco Dijkstra2024-04-096-372/+84
| | | | | | | Since __memcpy_simd is the fastest memcpy on almost all cores, replace the generic memcpy with it. (cherry picked from commit e6f3fe362f1aab78b1448d69ecdbd9e3872636d3)
* aarch64: correct CFI in rawmemchr (bug 31113)Andreas Schwab2024-04-091-1/+1
| | | | | | | | | The .cfi_return_column directive changes the return column for the whole FDE range. But the actual intent is to tell the unwinder that the value in x30 (lr) now resides in x15 after the move, and that is expressed by the .cfi_register directive. (cherry picked from commit 3f798427884fa57770e8e2291cf58d5918254bb5)
* AArch64: Improve strrchrWilco Dijkstra2024-04-091-25/+33
| | | | | | | | | Use shrn for narrowing the mask which simplifies code and speeds up small strings. Unroll the first search loop to improve performance on large strings. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 55599d480437dcf129b41b95be32b48f2a9e5da9)
* AArch64: Optimize strnlenWilco Dijkstra2024-04-091-21/+18
| | | | | | | | | Optimize strnlen using the shrn instruction and improve the main loop. Small strings are around 10% faster, large strings are 40% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit ad098893ba3c3344a5f2f6ab1627c47204afdb47)
* AArch64: Optimize strlenWilco Dijkstra2024-04-091-8/+12
| | | | | | | | Optimize strlen by unrolling the main loop. Large strings are 64% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 03c8ce5000198947a4dd7b2c14e5131738fda62b)
* AArch64: Optimize strcpyWilco Dijkstra2024-04-091-17/+19
| | | | | | | Unroll the main loop. Large strings are around 20% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 349e48c01e85bd96006860084e76d322e6ca02f1)
* AArch64: Improve strchrnulWilco Dijkstra2024-04-091-2/+10
| | | | | | | Unroll the main loop, which improves performance slightly. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 09ebd8549b2ce5a3a6c0c7c5f3e62227faf50a99)
* AArch64: Optimize strchrWilco Dijkstra2024-04-091-28/+24
| | | | | | | | Simplify calculation of the mask using shrn. Unroll the main loop. Small strings are 20% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 51541a229740801882490177fa178e49264b13fb)
* AArch64: Improve strlen_asimdWilco Dijkstra2024-04-091-12/+4
| | | | | | | | Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment. Performance improves slightly as a result. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 1bbb1a2022e126f21810d3d0ebe0a975d5243e43)
* AArch64: Optimize memrchrWilco Dijkstra2024-04-091-9/+11
| | | | | | | Optimize the main loop - large strings are 43% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 00776241776e67fc666b896c1e85770f4f3ec1e1)
* AArch64: Optimize memchrWilco Dijkstra2024-04-091-13/+14
| | | | | | | Optimize the main loop - large strings are 40% faster on modern CPUs. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit ce758d4f063820c2bc743e12797d7454c66be718)
* aarch64: Optimize string functions with shrn instructionDanila Kutenin2024-04-096-102/+59
| | | | | | | | | | | | | | | We found that string functions were using AND+ADDP to find the nibble/syndrome mask but there is an easier opportunity through `SHRN dst.8b, src.8h, 4` (shift right every 2 bytes by 4 and narrow to 1 byte) and has same latency on all SIMD ARMv8 targets as ADDP. There are also possible gaps for memcmp but that's for another patch. We see 10-20% savings for small-mid size cases (<=128) which are primary cases for general workloads. (cherry picked from commit 3c9980698988ef64072f1fac339b180f52792faf)
* AArch64: Optimize memcmpWilco Dijkstra2024-04-091-107/+134
| | | | | | | | | Rewrite memcmp to improve performance. On small and medium inputs performance is 10-20% better. Large inputs use a SIMD loop processing 64 bytes per iteration, which is 30-50% faster depending on the size. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit b51eb35c572b015641f03e3682c303f7631279b7)
* AArch64: Improve strnlen performanceWilco Dijkstra2024-04-091-181/+89
| | | | | | | | | Optimize strnlen by avoiding UMINV which is slow on most cores. On Neoverse N1 large strings are 1.8x faster than the current version, and bench-strnlen is 50% faster overall. This version is MTE compatible. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 252cad02d4c63540501b9b8c988cb91248563224)
* x86_64: Optimize ffsll function code size.Sunil K Pandey2024-01-311-5/+5
| | | | | | | | | | | | | | | | | | Ffsll function randomly regress by ~20%, depending on how code gets aligned in memory. Ffsll function code size is 17 bytes. Since default function alignment is 16 bytes, it can load on 16, 32, 48 or 64 bytes aligned memory. When ffsll function load at 16, 32 or 64 bytes aligned memory, entire code fits in single 64 bytes cache line. When ffsll function load at 48 bytes aligned memory, it splits in two cache line, hence random regression. Ffsll function size reduction from 17 bytes to 12 bytes ensures that it will always fit in single 64 bytes cache line. This patch fixes ffsll function random performance regression. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 9d94997b5f9445afd4f2bccc5fa60ff7c4361ec1)
* x86: Fix incorrect scope of setting `shared_per_thread` [BZ# 30745]Noah Goldstein2023-09-111-4/+3
| | | | | | | | | | | | | | | | The: ``` if (shared_per_thread > 0 && threads > 0) shared_per_thread /= threads; ``` Code was accidentally moved to inside the else scope. This doesn't match how it was previously (before af992e7abd). This patch fixes that by putting the division after the `else` block. (cherry picked from commit 084fb31bc2c5f95ae0b9e6df4d3cf0ff43471ede)
* x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold.Noah Goldstein2023-09-111-1/+10
| | | | | | | | | | | On some machines we end up with incomplete cache information. This can make the new calculation of `sizeof(total-L3)/custom-divisor` end up lower than intended (and lower than the prior value). So reintroduce the old bound as a lower bound to avoid potentially regressing code where we don't have complete information to make the decision. Reviewed-by: DJ Delorie <dj@redhat.com> (cherry picked from commit 8b9a0af8ca012217bf90d1dc0694f85b49ae09da)
* x86: Fix slight bug in `shared_per_thread` cache size calculation.Noah Goldstein2023-09-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | After: ``` commit af992e7abdc9049714da76cae1e5e18bc4838fb8 Author: Noah Goldstein <goldstein.w.n@gmail.com> Date: Wed Jun 7 13:18:01 2023 -0500 x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` ``` Split `shared` (cumulative cache size) from `shared_per_thread` (cache size per socket), the `shared_per_thread` *can* be slightly off from the previous calculation. Previously we added `core` even if `threads_l2` was invalid, and only used `threads_l2` to divide `core` if it was present. The changed version only included `core` if `threads_l2` was valid. This change restores the old behavior if `threads_l2` is invalid by adding the entire value of `core`. Reviewed-by: DJ Delorie <dj@redhat.com> (cherry picked from commit 47f747217811db35854ea06741be3685e8bbd44d)
* x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`Noah Goldstein2023-09-111-31/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 / ncores_per_socket'. This patch updates that value to roughly 'sizeof_L3 / 4` The original value (specifically dividing the `ncores_per_socket`) was done to limit the amount of other threads' data a `memcpy`/`memset` could evict. Dividing by 'ncores_per_socket', however leads to exceedingly low non-temporal thresholds and leads to using non-temporal stores in cases where REP MOVSB is multiple times faster. Furthermore, non-temporal stores are written directly to main memory so using it at a size much smaller than L3 can place soon to be accessed data much further away than it otherwise could be. As well, modern machines are able to detect streaming patterns (especially if REP MOVSB is used) and provide LRU hints to the memory subsystem. This in affect caps the total amount of eviction at 1/cache_associativity, far below meaningfully thrashing the entire cache. As best I can tell, the benchmarks that lead this small threshold where done comparing non-temporal stores versus standard cacheable stores. A better comparison (linked below) is to be REP MOVSB which, on the measure systems, is nearly 2x faster than non-temporal stores at the low-end of the previous threshold, and within 10% for over 100MB copies (well past even the current threshold). In cases with a low number of threads competing for bandwidth, REP MOVSB is ~2x faster up to `sizeof_L3`. The divisor of `4` is a somewhat arbitrary value. From benchmarks it seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs such as Broadwell prefer something closer to `8`. This patch is meant to be followed up by another one to make the divisor cpu-specific, but in the meantime (and for easier backporting), this patch settles on `4` as a middle-ground. Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable stores where done using: https://github.com/goldsteinn/memcpy-nt-benchmarks Sheets results (also available in pdf on the github): https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml Reviewed-by: DJ Delorie <dj@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit af992e7abdc9049714da76cae1e5e18bc4838fb8)
* debug: Mark libSegFault.so as NODELETEFlorian Weimer2023-07-211-0/+2
| | | | | | | | | | The signal handler installed in the ELF constructor cannot easily be removed again (because the program may have changed handlers in the meantime). Mark the object as NODELETE so that the registered handler function is never unloaded. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit 23ee92deea4c99d0e6a5f48fa7b942909b123ec5)
* Document BZ #20975 fixH.J. Lu2023-05-231-0/+1
|
* __check_pf: Add a cancellation cleanup handler [BZ #20975]H.J. Lu2023-05-232-0/+17
| | | | | | | | | | | | | | | | | | | | | | There are reports for hang in __check_pf: https://github.com/JoeDog/siege/issues/4 It is reproducible only under specific configurations: 1. Large number of cores (>= 64) and large number of threads (> 3X of the number of cores) with long lived socket connection. 2. Low power (frequency) mode. 3. Power management is enabled. While holding lock, __check_pf calls make_request which calls __sendto and __recvmsg. Since __sendto and __recvmsg are cancellation points, lock held by __check_pf won't be released and can cause deadlock when thread cancellation happens in __sendto or __recvmsg. Add a cancellation cleanup handler for __check_pf to unlock the lock when cancelled by another thread. This fixes BZ #20975 and the siege hang issue. (cherry picked from commit a443bd3fb233186038b8b483959ecb7978d1abea)
* x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591]Noah Goldstein2022-11-242-28/+43
| | | | | | | | | | | | | Previous implementation was adjusting length (rsi) to match bytes (eax), but since there is no bound to length this can cause overflow. Fix is to just convert the byte-count (eax) to length by dividing by sizeof (wchar_t) before the comparison. Full check passes on x86-64 and build succeeds w/ and w/o multiarch. (cherry picked from commit b0969fa53a28b4ab2159806bf6c99a98999502ee)
* x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementationsAurelien Jarno2022-10-042-3/+15
| | | | | | | | | | | | The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk' instruction which belongs to the BMI1 CPU feature and the 'shrx' instruction, which belongs to the BMI2 CPU feature. Fixes: df7e295d18ff ("x86: Optimize {str|wcs}rchr-avx2") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 7e8283170c5d6805b609a040801d819e362a6292)
* x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementationAurelien Jarno2022-10-042-2/+9
| | | | | | | | | | | | The AVX2 memrchr implementation uses the 'shlxl' instruction, which belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which belongs to the LZCNT CPU feature. Fixes: af5306a735eb ("x86: Optimize memrchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 3c0c78afabfed4b6fc161c159e628fbf14ff370b)
* x86-64: Require BMI2 for AVX2 (raw|w)memchr implementationsAurelien Jarno2022-10-041-3/+9
| | | | | | | | | | | The AVX2 memchr, rawmemchr and wmemchr implementations use the 'bzhi' and 'sarx' instructions, which belongs to the BMI2 CPU feature. Fixes: acfd088a1963 ("x86: Optimize memchr-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit e3e7fab7fe5186d18ca2046d99ba321c27db30ad)
* x86-64: Require BMI2 for AVX2 wcs(n)cmp implementationsAurelien Jarno2022-10-041-2/+6
| | | | | | | | | | | | | | | The AVX2 wcs(n)cmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit f31a5a884ed84bd37032729d4d1eb9d06c9f3c29)
* x86-64: Require BMI2 for AVX2 strncmp implementationAurelien Jarno2022-10-042-4/+7
| | | | | | | | | | | | | | | The AVX2 strncmp implementations uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit fc7de1d9b99ae1676bc626ddca422d7abee0eb48)
* x86-64: Require BMI2 for AVX2 strcmp implementationAurelien Jarno2022-10-042-3/+5
| | | | | | | | | | | | | | | The AVX2 strcmp implementation uses the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 4d64c6445735e9b34e2ac8e369312cbfc2f88e17)
* x86-64: Require BMI2 for AVX2 str(n)casecmp implementationsAurelien Jarno2022-10-042-8/+21
| | | | | | | | | | | | | | | The AVX2 str(n)casecmp implementations use the 'bzhi' instruction, which belongs to the BMI2 CPU feature. NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF as BSF if the CPU doesn't support TZCNT, and produces the same result for non-zero input. Partially fixes: b77b06e0e296 ("x86: Optimize strcmp-avx2.S") Partially resolves: BZ #29611 Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 10f79d3670b036925da63dc532b122d27ce65ff8)
* x86: include BMI1 and BMI2 in x86-64-v3 levelAurelien Jarno2022-10-041-0/+2
| | | | | | | | | The "System V Application Binary Interface AMD64 Architecture Processor Supplement" mandates the BMI1 and BMI2 CPU features for the x86-64-v3 level. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit b80f16adbd979831bf25ea491e1261e81885c2b6)
* nptl: Add backoff mechanism to spinlock loopWangyang Guo2022-09-284-2/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mutiple threads waiting for lock at the same time, once lock owner releases the lock, waiters will see lock available and all try to lock, which may cause an expensive CAS storm. Binary exponential backoff with random jitter is introduced. As try-lock attempt increases, there is more likely that a larger number threads compete for adaptive mutex lock, so increase wait time in exponential. A random jitter is also added to avoid synchronous try-lock from other threads. v2: Remove read-check before try-lock for performance. v3: 1. Restore read-check since it works well in some platform. 2. Make backoff arch dependent, and enable it for x86_64. 3. Limit max backoff to reduce latency in large critical section. v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h v5: Commit log updated for regression in large critical section. Result of pthread-mutex-locks bench Test Platform: Xeon 8280L (2 socket, 112 CPUs in total) First Row: thread number First Col: critical section length Values: backoff vs upstream, time based, low is better non-critical-length: 1 1 2 4 8 16 32 64 112 140 0 0.99 0.58 0.52 0.49 0.43 0.44 0.46 0.52 0.54 1 0.98 0.43 0.56 0.50 0.44 0.45 0.50 0.56 0.57 2 0.99 0.41 0.57 0.51 0.45 0.47 0.48 0.60 0.61 4 0.99 0.45 0.59 0.53 0.48 0.49 0.52 0.64 0.65 8 1.00 0.66 0.71 0.63 0.56 0.59 0.66 0.72 0.71 16 0.97 0.78 0.91 0.73 0.67 0.70 0.79 0.80 0.80 32 0.95 1.17 0.98 0.87 0.82 0.86 0.89 0.90 0.90 64 0.96 0.95 1.01 1.01 0.98 1.00 1.03 0.99 0.99 128 0.99 1.01 1.01 1.17 1.08 1.12 1.02 0.97 1.02 non-critical-length: 32 1 2 4 8 16 32 64 112 140 0 1.03 0.97 0.75 0.65 0.58 0.58 0.56 0.70 0.70 1 0.94 0.95 0.76 0.65 0.58 0.58 0.61 0.71 0.72 2 0.97 0.96 0.77 0.66 0.58 0.59 0.62 0.74 0.74 4 0.99 0.96 0.78 0.66 0.60 0.61 0.66 0.76 0.77 8 0.99 0.99 0.84 0.70 0.64 0.66 0.71 0.80 0.80 16 0.98 0.97 0.95 0.76 0.70 0.73 0.81 0.85 0.84 32 1.04 1.12 1.04 0.89 0.82 0.86 0.93 0.91 0.91 64 0.99 1.15 1.07 1.00 0.99 1.01 1.05 0.99 0.99 128 1.00 1.21 1.20 1.22 1.25 1.31 1.12 1.10 0.99 non-critical-length: 128 1 2 4 8 16 32 64 112 140 0 1.02 1.00 0.99 0.67 0.61 0.61 0.61 0.74 0.73 1 0.95 0.99 1.00 0.68 0.61 0.60 0.60 0.74 0.74 2 1.00 1.04 1.00 0.68 0.59 0.61 0.65 0.76 0.76 4 1.00 0.96 0.98 0.70 0.63 0.63 0.67 0.78 0.77 8 1.01 1.02 0.89 0.73 0.65 0.67 0.71 0.81 0.80 16 0.99 0.96 0.96 0.79 0.71 0.73 0.80 0.84 0.84 32 0.99 0.95 1.05 0.89 0.84 0.85 0.94 0.92 0.91 64 1.00 0.99 1.16 1.04 1.00 1.02 1.06 0.99 0.99 128 1.00 1.06 0.98 1.14 1.39 1.26 1.08 1.02 0.98 There is regression in large critical section. But adaptive mutex is aimed for "quick" locks. Small critical section is more common when users choose to use adaptive pthread_mutex. Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 8162147872491bb5b48e91543b19c49a29ae6b6d)
* sysdeps: Add 'get_fast_jitter' interace in fast-jitter.hNoah Goldstein2022-09-281-0/+42
| | | | | | | | | | | | | | | | 'get_fast_jitter' is meant to be used purely for performance purposes. In all cases it's used it should be acceptable to get no randomness (see default case). An example use case is in setting jitter for retries between threads at a lock. There is a performance benefit to having jitter, but only if the jitter can be generated very quickly and ultimately there is no serious issue if no jitter is generated. The implementation generally uses 'HP_TIMING_NOW' iff it is inlined (avoid any potential syscall paths). Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 911c63a51c690dd1a97dfc587097277029baf00f)
* nptl: Effectively skip CAS in spinlock loopJangwoong Kim2022-09-281-3/+2
| | | | | | | | | | | | | | | | | The commit: "Add LLL_MUTEX_READ_LOCK [BZ #28537]" SHA1: d672a98a1af106bd68deb15576710cd61363f7a6 introduced LLL_MUTEX_READ_LOCK, to skip CAS in spinlock loop if atomic load fails. But, "continue" inside of do-while loop does not skip the evaluation of escape expression, thus CAS is not skipped. Replace do-while with while and skip LLL_MUTEX_TRYLOCK if LLL_MUTEX_READ_LOCK fails. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 6b8dbbd03ac88f169b65b5c7d7278576a11d2e44)
* Move assignment out of the CAS conditionH.J. Lu2022-09-282-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Update commit 49302b8fdf9103b6fc0a398678668a22fa19574c Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Nov 11 06:54:01 2021 -0800 Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537] Replace boolean CAS with value CAS to avoid the extra load. and commit 0b82747dc48d5bf0871bdc6da8cb6eec1256355f Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Nov 11 06:31:51 2021 -0800 Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537] Replace boolean CAS with value CAS to avoid the extra load. by moving assignment out of the CAS condition. (cherry picked from commit 120ac6d238825452e8024e2f627da33b2508dfd3)
* Add LLL_MUTEX_READ_LOCK [BZ #28537]H.J. Lu2022-09-281-0/+7
| | | | | | | | | | | | | | | | | CAS instruction is expensive. From the x86 CPU's point of view, getting a cache line for writing is more expensive than reading. See Appendix A.2 Spinlock in: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf The full compare and swap will grab the cache line exclusive and cause excessive cache line bouncing. Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock loop if compare may fail to reduce cache line bouncing on contended locks. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit d672a98a1af106bd68deb15576710cd61363f7a6)
* Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]H.J. Lu2022-09-281-5/+5
| | | | | | | Replace boolean CAS with value CAS to avoid the extra load. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 49302b8fdf9103b6fc0a398678668a22fa19574c)
* Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]H.J. Lu2022-09-281-5/+5
| | | | | | | Replace boolean CAS with value CAS to avoid the extra load. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 0b82747dc48d5bf0871bdc6da8cb6eec1256355f)
* elf: Call __libc_early_init for reused namespaces (bug 29528)Florian Weimer2022-08-306-6/+137
| | | | | | | | | | | | libc_map is never reset to NULL, neither during dlclose nor on a dlopen call which reuses the namespace structure. As a result, if a namespace is reused, its libc is not initialized properly. The most visible result is a crash in the <ctype.h> functions. To prevent similar bugs on namespace reuse from surfacing, unconditionally initialize the chosen namespace to zero using memset. (cherry picked from commit d0e357ff45a75553dee3b17ed7d303bfa544f6fe)
* x86: Add missing IS_IN (libc) check to strncmp-sse4_2.SNoah Goldstein2022-07-181-3/+5
| | | | | | | | | | | | | | | | | | Was missing to for the multiarch build rtld-strncmp-sse4_2.os was being built and exporting symbols: build/glibc/string/rtld-strncmp-sse4_2.os: 0000000000000000 T __strncmp_sse42 Introduced in: commit 11ffcacb64a939c10cfc713746b8ec88837f5c4a Author: H.J. Lu <hjl.tools@gmail.com> Date: Wed Jun 21 12:10:50 2017 -0700 x86-64: Implement strcmp family IFUNC selectors in C (cherry picked from commit 96ac447d915ea5ecef3f9168cc13f4e731349a3b)
* x86: Move mem{p}{mov|cpy}_{chk_}erms to its own fileNoah Goldstein2022-07-183-50/+73
| | | | | | | | The primary memmove_{impl}_unaligned_erms implementations don't interact with this function. Putting them in same file both wastes space and unnecessarily bloats a hot code section. (cherry picked from commit 21925f64730d52eb7d8b2fb62b412f8ab92b0caf)
* x86: Move and slightly improve memset_ermsNoah Goldstein2022-07-183-31/+45
| | | | | | | | | | | | | | | | Implementation wise: 1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not use the L(stosb) label that was previously defined. 2. Don't give the hotpath (fallthrough) to zero size. Code positioning wise: Move memset_{chk}_erms to its own file. Leaving it in between the memset_{impl}_unaligned both adds unnecessary complexity to the file and wastes space in a relatively hot cache section. (cherry picked from commit 4a3f29e7e475dd4e7cce2a24c187e6fb7b5b0a05)