about summary refs log tree commit diff
path: root/sysdeps/powerpc/powerpc64/le
Commit message (Collapse)AuthorAgeFilesLines
* powerpc: Placeholder and infrastructure/build support to add Power11 ↵Amrita H S2024-03-195-2/+9
| | | | | | | | | | | | related changes. The following three changes have been added to provide initial Power11 support. 1. Add the directories to hold Power11 files. 2. Add support to select Power11 libraries based on AT_PLATFORM. 3. Let submachine=power11 be set automatically. Reviewed-by: Florian Weimer <fweimer@redhat.com> Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
* Rename c2x / gnu2x tests to c23 / gnu23Joseph Myers2024-02-011-2/+2
| | | | | | | Complete the internal renaming from "C2X" and related names in GCC by renaming *-c2x and *-gnu2x tests to *-c23 and *-gnu23. Tested for x86_64, and with build-many-glibcs.py for powerpc64le.
* Update copyright dates with scripts/update-copyrightsPaul Eggert2024-01-0130-30/+30
|
* powerpc: Fix performance issues of strcmp power10Amrita H S2023-12-151-66/+95
| | | | | | | | | | | | | | | | | Current implementation of strcmp for power10 has performance regression for multiple small sizes and alignment combination. Most of these performance issues are fixed by this patch. The compare loop is unrolled and page crosses of unrolled loop is handled. Thanks to Paul E. Murphy for helping in fixing the performance issues. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com> Co-Authored-By: Paul E. Murphy <murphyp@linux.ibm.com> Reviewed-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* powerpc : Add optimized memchr for POWER10MAHESH BODAPATI2023-12-141-0/+315
| | | | | | Optimized memchr for POWER10 based on existing rawmemchr and strlen. Reordering instructions and loop unrolling helped in getting better performance. Reviewed-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* powerpc: Optimized strcmp for power10Amrita H S2023-12-071-0/+204
| | | | | | | | | | | | | | | | | | | This patch is based on __strcmp_power9 and __strlen_power10. Improvements from __strcmp_power9: 1. Uses new POWER10 instructions - This code uses lxvp to decrease contention on load by loading 32 bytes per instruction. 2. Performance implication - This version has around 30% better performance on average. - Performance regression is seen for a specific combination of sizes and alignments. Some of them is observed without changes also, while rest may be induced by the patch. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com> Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com>
* configure: Use autoconf 2.71Siddhesh Poyarekar2023-07-172-72/+94
| | | | | | | | | | | | | | Bump autoconf requirement to 2.71 to allow regenerating configure on more recent distributions. autoconf 2.71 has been in Fedora since F36 and is the current version in Debian stable (bookworm). It appears to be current in Gentoo as well. All sysdeps configure and preconfigure scripts have also been regenerated; all changes are trivial transformations that do not affect functionality. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* Fix misspellings in sysdeps/powerpc -- BZ 25337Paul Pluzhnikov2023-05-233-3/+3
| | | | | | | All fixes are in comments, so the binaries should be identical before/after this commit, but I can't verify this. Reviewed-by: Rajalakshmi Srinivasaraghavan <rajis@linux.ibm.com>
* powerpc:GCC(<10) doesn't allow -mlong-double-64 after -mabi=ieeelongdoubleMahesh Bodapati2023-05-191-0/+17
| | | | | Removed -mabi=ieeelongdouble on failing tests. It resolves the error. error: ‘-mabi=ieeelongdouble’ requires ‘-mlong-double-128’
* Added Redirects to longdouble error functions [BZ #29033]Sachin Monga2023-05-101-0/+1
| | | | | | | | This patch redirects the error functions to the appropriate longdouble variants which enables the compiler to optimize for the abi ieeelongdouble. Signed-off-by: Sachin Monga <smonga@linux.ibm.com>
* Update copyright dates with scripts/update-copyrightsJoseph Myers2023-01-0628-28/+28
|
* Disable use of -fsignaling-nans if compiler does not support itAdhemerval Zanella2022-11-011-2/+2
| | | | Reviewed-by: Fangrui Song <maskray@google.com>
* Fix build with GCC 13 _FloatN, _FloatNx built-in functionsJoseph Myers2022-10-311-0/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC 13 has added more _FloatN and _FloatNx versions of existing <math.h> and <complex.h> built-in functions, for use in libstdc++-v3. This breaks the glibc build because of how those functions are defined as aliases to functions with the same ABI but different types. Add appropriate -fno-builtin-* options for compiling relevant files, as already done for the case of long double functions aliasing double ones and based on the list of files used there. I fixed some mistakes in that list of double files that I noticed while implementing this fix, but there may well be more such (harmless) cases, in this list or the new one (files that don't actually exist or don't define the named functions as aliases so don't need the options). I did try to exclude cases where glibc doesn't define certain functions for _FloatN or _FloatNx types at all from the new uses of -fno-builtin-* options. As with the options for double files (see the commit message for commit 49348beafe9ba150c9bd48595b3f372299bddbb0, "Fix build with GCC 10 when long double = double."), it's deliberate that the options are used even if GCC currently doesn't have a built-in version of a given functions, so providing some level of future-proofing against more such built-in functions being added in future. Tested with build-many-glibcs.py for aarch64-linux-gnu powerpc-linux-gnu powerpc64le-linux-gnu x86_64-linux-gnu (compilers and glibcs builds) with GCC mainline.
* elf: Remove -fno-tree-loop-distribute-patterns usage on dl-supportAdhemerval Zanella2022-10-101-0/+24
| | | | | | | | | | | | | | Besides the option being gcc specific, this approach is still fragile and not future proof since we do not know if this will be the only optimization option gcc will add that transforms loops to memset (or any libcall). This patch adds a new header, dl-symbol-redir-ifunc.h, that can b used to redirect the compiler generated libcalls to port the generic memset implementation if required. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* powerpc: Fix VSX register number on __strncpy_power9 [BZ #29197]Matheus Castanho2022-06-071-2/+2
| | | | | | | | | | | | | | | __strncpy_power9 initializes VR 18 with zeroes to be used throughout the code, including when zero-padding the destination string. However, the v18 reference was mistakenly being used for stxv and stxvl, which take a VSX vector as operand. The code ended up using the uninitialized VSR 18 register by mistake. Both occurrences have been changed to use the proper VSX number for VR 18 (i.e. VSR 50). Tested on powerpc, powerpc64 and powerpc64le. Signed-off-by: Kewen Lin <linkw@gcc.gnu.org>
* configure.ac: fix bashisms in configure.acSam James2022-03-224-4/+4
| | | | | | | | | | | | | | | | | | | configure scripts need to be runnable with a POSIX-compliant /bin/sh. On many (but not all!) systems, /bin/sh is provided by Bash, so errors like this aren't spotted. Notably Debian defaults to /bin/sh provided by dash which doesn't tolerate such bashisms as '=='. This retains compatibility with bash. Fixes configure warnings/errors like: ``` checking if compiler warns about alias for function with incompatible types... yes /var/tmp/portage/sys-libs/glibc-2.34-r10/work/glibc-2.34/configure: 4209: test: xyes: unexpected operator ``` Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Signed-off-by: Sam James <sam@gentoo.org>
* powerpc: Remove powerpc64 bzero optimizationsAdhemerval Zanella2022-02-231-12/+0
| | | | | The symbol is not present in current POSIX specification and compiler already generates memset call.
* powerpc: Remove bcopy optimizationsAdhemerval Zanella2022-02-231-13/+0
| | | | | The symbols is not present in current POSIX specification and compiler already generates memmove call.
* powerpc64le: Use <gcc-macros.h> in early HWCAP checkFlorian Weimer2022-01-141-4/+5
| | | | | | | | This is required so that the checks still work if $(early-cflags) selects a different ISA level. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
* Update copyright dates with scripts/update-copyrightsPaul Eggert2022-01-0127-27/+27
| | | | | | | | | | | | | | | | | | | | | | | I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 7061 files FOO. I then removed trailing white space from math/tgmath.h, support/tst-support-open-dev-null-range.c, and sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following obscure pre-commit check failure diagnostics from Savannah. I don't know why I run into these diagnostics whereas others evidently do not. remote: *** 912-#endif remote: *** 913: remote: *** 914- remote: *** error: lines with trailing whitespace found ... remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
* String: Add hidden defs for __memcmpeq() to enable internal usageNoah Goldstein2021-10-261-0/+1
| | | | | | | | No bug. This commit adds hidden defs for all declarations of __memcmpeq. This enables usage of __memcmpeq without the PLT for usage internal to GLIBC.
* String: Add support for __memcmpeq() ABI on all targetsNoah Goldstein2021-10-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | No bug. This commit adds support for __memcmpeq() as a new ABI for all targets. In this commit __memcmpeq() is implemented only as an alias to the corresponding targets memcmp() implementation. __memcmpeq() is added as a new symbol starting with GLIBC_2.35 and defined in string.h with comments explaining its behavior. Basic tests that it is callable and works where added in string/tester.c As discussed in the proposal "Add new ABI '__memcmpeq()' to libc" __memcmpeq() is essentially a reserved namespace for bcmp(). The means is shares the same specifications as memcmp() except the return value for non-equal byte sequences is any non-zero value. This is less strict than memcmp()'s return value specification and can be better optimized when a boolean return is all that is needed. __memcmpeq() is meant to only be called by compilers if they can prove that the return value of a memcmp() call is only used for its boolean value. All tests in string/tester.c passed. As well build succeeds on x86_64-linux-gnu target.
* Add narrowing fma functionsJoseph Myers2021-09-222-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the narrowing fused multiply-add functions from TS 18661-1 / TS 18661-3 / C2X to glibc's libm: ffma, ffmal, dfmal, f32fmaf64, f32fmaf32x, f32xfmaf64 for all configurations; f32fmaf64x, f32fmaf128, f64fmaf64x, f64fmaf128, f32xfmaf64x, f32xfmaf128, f64xfmaf128 for configurations with _Float64x and _Float128; __f32fmaieee128 and __f64fmaieee128 aliases in the powerpc64le case (for calls to ffmal and dfmal when long double is IEEE binary128). Corresponding tgmath.h macro support is also added. The changes are mostly similar to those for the other narrowing functions previously added, especially that for sqrt, so the description of those generally applies to this patch as well. As with sqrt, I reused the same test inputs in auto-libm-test-in as for non-narrowing fma rather than adding extra or separate inputs for narrowing fma. The tests in libm-test-narrow-fma.inc also follow those for non-narrowing fma. The non-narrowing fma has a known bug (bug 6801) that it does not set errno on errors (overflow, underflow, Inf * 0, Inf - Inf). Rather than fixing this or having narrowing fma check for errors when non-narrowing does not (complicating the cases when narrowing fma can otherwise be an alias for a non-narrowing function), this patch does not attempt to check for errors from narrowing fma and set errno; the CHECK_NARROW_FMA macro is still present, but as a placeholder that does nothing, and this missing errno setting is considered to be covered by the existing bug rather than needing a separate open bug. missing-errno annotations are duly added to many of the auto-libm-test-in test inputs for fma. This completes adding all the new functions from TS 18661-1 to glibc, so will be followed by corresponding stdc-predef.h changes to define __STDC_IEC_60559_BFP__ and __STDC_IEC_60559_COMPLEX__, as the support for TS 18661-1 will be at a similar level to that for C standard floating-point facilities up to C11 (pragmas not implemented, but library functions done). (There are still further changes to be done to implement changes to the types of fromfp functions from N2548.) Tested as followed: natively with the full glibc testsuite for x86_64 (GCC 11, 7, 6) and x86 (GCC 11); with build-many-glibcs.py with GCC 11, 7 and 6; cross testing of math/ tests for powerpc64le, powerpc32 hard float, mips64 (all three ABIs, both hard and soft float). The different GCC versions are to cover the different cases in tgmath.h and tgmath.h tests properly (GCC 6 has _Float* only as typedefs in glibc headers, GCC 7 has proper _Float* support, GCC 8 adds __builtin_tgmath).
* Add narrowing square root functionsJoseph Myers2021-09-102-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the narrowing square root functions from TS 18661-1 / TS 18661-3 / C2X to glibc's libm: fsqrt, fsqrtl, dsqrtl, f32sqrtf64, f32sqrtf32x, f32xsqrtf64 for all configurations; f32sqrtf64x, f32sqrtf128, f64sqrtf64x, f64sqrtf128, f32xsqrtf64x, f32xsqrtf128, f64xsqrtf128 for configurations with _Float64x and _Float128; __f32sqrtieee128 and __f64sqrtieee128 aliases in the powerpc64le case (for calls to fsqrtl and dsqrtl when long double is IEEE binary128). Corresponding tgmath.h macro support is also added. The changes are mostly similar to those for the other narrowing functions previously added, so the description of those generally applies to this patch as well. However, the not-actually-narrowing cases (where the two types involved in the function have the same floating-point format) are aliased to sqrt, sqrtl or sqrtf128 rather than needing a separately built not-actually-narrowing function such as was needed for add / sub / mul / div. Thus, there is no __nldbl_dsqrtl name for ldbl-opt because no such name was needed (whereas the other functions needed such a name since the only other name for that entry point was e.g. f32xaddf64, not reserved by TS 18661-1); the headers are made to arrange for sqrt to be called in that case instead. The DIAG_* calls in sysdeps/ieee754/soft-fp/s_dsqrtl.c are because they were observed to be needed in GCC 7 testing of riscv32-linux-gnu-rv32imac-ilp32. The other sysdeps/ieee754/soft-fp/ files added didn't need such DIAG_* in any configuration I tested with build-many-glibcs.py, but if they do turn out to be needed in more files with some other configuration / GCC version, they can always be added there. I reused the same test inputs in auto-libm-test-in as for non-narrowing sqrt rather than adding extra or separate inputs for narrowing sqrt. The tests in libm-test-narrow-sqrt.inc also follow those for non-narrowing sqrt. Tested as followed: natively with the full glibc testsuite for x86_64 (GCC 11, 7, 6) and x86 (GCC 11); with build-many-glibcs.py with GCC 11, 7 and 6; cross testing of math/ tests for powerpc64le, powerpc32 hard float, mips64 (all three ABIs, both hard and soft float). The different GCC versions are to cover the different cases in tgmath.h and tgmath.h tests properly (GCC 6 has _Float* only as typedefs in glibc headers, GCC 7 has proper _Float* support, GCC 8 adds __builtin_tgmath).
* powerpc64le: Fix typo in configureAnton Blanchard2021-07-082-2/+2
| | | | | | The configure script checks for -mlong-double-128 but mentions -mlongdouble when it fails. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc: optimize strcpy/stpcpy for POWER9/10Pedro Franco de Carvalho2021-07-011-71/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch modifies the current POWER9 implementation of strcpy and stpcpy to optimize it for POWER9/10. Since no new POWER10 instructions are used, the original POWER9 strcpy is modified instead of creating a new implementation for POWER10. This implementation is based on both the original POWER9 implementation of strcpy and the preamble of the new POWER10 implementation of strlen. The changes also affect stpcpy, which uses the same implementation with some additional code before returning. On POWER9, averaging improvements across the benchmark inputs (length/source alignment/destination alignment), for an experiment that ran the benchmark five times, bench-strcpy showed an improvement of 5.23%, and bench-stpcpy showed an improvement of 6.59%. On POWER10, bench-strcpy showed 13.16%, and bench-stpcpy showed 13.59%. The changes are: 1. Removed the null string optimization. Although this results in a few extra cycles for the null string, in combination with the second change, this resulted in improvements for for other cases. 2. Adapted the preamble from strlen for POWER10. This is the part of the function that handles up to the first 16 bytes of the string. 3. Increased number of unrolled iterations in the main loop to 6. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Tested-by: Matheus Castanho <msc@linux.ibm.com>
* powerpc: Optimized memcmp for power10Lucas A. M. Magalhaes2021-05-311-0/+179
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch was based on the __memcmp_power8 and the recent __strlen_power10. Improvements from __memcmp_power8: 1. Don't need alignment code. On POWER10 lxvp and lxvl do not generate alignment interrupts, so they are safe for use on caching-inhibited memory. Notice that the comparison on the main loop will wait for both VSR to be ready. Therefore aligning one of the input address does not improve performance. In order to align both registers a vperm is necessary which add too much overhead. 2. Uses new POWER10 instructions This code uses lxvp to decrease contention on load by loading 32 bytes per instruction. The vextractbm is used to have a smaller tail code for calculating the return value. 3. Performance improvement This version has around 35% better performance on average. I saw no performance regressions for any length or alignment. Thanks Matheus for helping me out with some details. Co-authored-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* powerpc64le: Check HWCAP bits against compiler build flagsFlorian Weimer2021-05-191-0/+52
| | | | | | | When built with GCC 11.1 and -mcpu=power9, ld.so prints this error message when running on POWER8: Fatal glibc error: CPU lacks ISA 3.00 support (POWER9 or later required)
* powerpc: Add optimized rawmemchr for POWER10Matheus Castanho2021-05-172-25/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reuse code for optimized strlen to implement a faster version of rawmemchr. This takes advantage of the same benefits provided by the strlen implementation, but needs some extra steps. __strlen_power10 code should be unchanged after this change. rawmemchr returns a pointer to the char found, while strlen returns only the length, so we have to take that into account when preparing the return value. To quickly check 64B, the loop on __strlen_power10 merges the whole block into 16B by using unsigned minimum vector operations (vminub) and checks if there are any \0 on the resulting vector. The same code is used by rawmemchr if the char c is 0. However, this approach does not work when c != 0. We first need to subtract each byte by c, so that the value we are looking for is converted to a 0, then taking the minimum and checking for nulls works again. The new code branches after it has compared ~256 bytes and chooses which of the two strategies above will be used in the main loop, based on the char c. This extra branch adds some overhead (~5%) for length ~256, but is quickly amortized by the faster loop for larger sizes. Compared to __rawmemchr_power9, this version is ~20% faster for length < 256. Because of the optimized main loop, the improvement becomes ~35% for c != 0 and ~50% for c = 0 for strings longer than 256. Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* powerpc64le: Optimize memset for POWER10Raoni Fassina Firmino2021-04-301-0/+256
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This implementation is based on __memset_power8 and integrates a lot of suggestions from Anton Blanchard. The biggest difference is that it makes extensive use of stxvl to alignment and tail code to avoid branches and small stores. It has three main execution paths: a) "Short lengths" for lengths up to 64 bytes, avoiding as many branches as possible. b) "General case" for larger lengths, it has an alignment section using stxvl to avoid branches, a 128 bytes loop and then a tail code, again using stxvl with few branches. c) "Zeroing cache blocks" for lengths from 256 bytes upwards and set value being zero. It is mostly the __memset_power8 code but the alignment phase was simplified because, at this point, address is already 16-bytes aligned and also changed to use vector stores. The tail code was also simplified to reuse the general case tail. All unaligned stores use stxvl instructions that do not generate alignment interrupts on POWER10, making it safe to use on caching-inhibited memory. On average, this implementation provides something around 30% improvement when compared to __memset_power8. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64le: Optimize memcpy for POWER10Tulio Magno Quites Machado Filho2021-04-301-0/+198
| | | | | | | | | | | | | | | | | | This implementation is based on __memcpy_power8_cached and integrates suggestions from Anton Blanchard. It benefits from loads and stores with length for short lengths and for tail code, simplifying the code. All unaligned memory accesses use instructions that do not generate alignment interrupts on POWER10, making it safe to use on caching-inhibited memory. The main loop has also been modified in order to increase instruction throughput by reducing the dependency on updates from previous iterations. On average, this implementation provides around 30% improvement when compared to __memcpy_power7 and 10% improvement in comparison to __memcpy_power8_cached.
* powerpc64le: Optimized memmove for POWER10Lucas A. M. Magalhaes2021-04-301-0/+320
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch was initially based on the __memmove_power7 with some ideas from strncpy implementation for Power 9. Improvements from __memmove_power7: 1. Use lxvl/stxvl for alignment code. The code for Power 7 uses branches when the input is not naturally aligned to the width of a vector. The new implementation uses lxvl/stxvl instead which reduces pressure on GPRs. It also allows the removal of branch instructions, implicitly removing branch stalls and mispredictions. 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited memory. On Power 10 vector load and stores are safe to use on CI memory for addresses unaligned to 16B. This code takes advantage of this to do unaligned loads. The unaligned loads don't have a significant performance impact by themselves. However doing so decreases register pressure on GPRs and interdependence stalls on load/store pairs. This also improved readability as there are now less code paths for different alignments. Finally this reduces the overall code size. 3. Improved performance. This version runs on average about 30% better than memmove_power7 for lengths larger than 8KB. For input lengths shorter than 8KB the improvement is smaller, it has on average about 17% better performance. This version has a degradation of about 50% for input lengths in the 0 to 31 bytes range when dest is unaligned. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc: Add log IFUNC multiarch support for POWER10Raphael Moreira Zinsly2021-04-267-0/+101
| | | | | | | Checked on ppc64le built without --with-cpu, with --with-cpu=power9 and with --disable-multi-arch. Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
* powerpc: Add optimized strlen for POWER10Matheus Castanho2021-04-221-0/+221
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improvements compared to POWER9 version: 1. Take into account first 16B comparison for aligned strings The previous version compares the first 16B and increments r4 by the number of bytes until the address is 16B-aligned, then starts doing aligned loads at that address. For aligned strings, this causes the first 16B to be compared twice, because the increment is 0. Here we calculate the next 16B-aligned address differently, which avoids that issue. 2. Use simple comparisons for the first ~192 bytes The main loop is good for big strings, but comparing 16B each time is better for smaller strings. So after aligning the address to 16 Bytes, we check more 176B in 16B chunks. There may be some overlaps with the main loop for unaligned strings, but we avoid using the more aggressive strategy too soon, and also allow the loop to start at a 64B-aligned address. This greatly benefits smaller strings and avoids overlapping checks if the string is already aligned at a 64B boundary. 3. Reduce dependencies between load blocks caused by address calculation on loop Doing a precise time tracing on the code showed many loads in the loop were stalled waiting for updates to r4 from previous code blocks. This implementation avoids that as much as possible by using 2 registers (r4 and r5) to hold addresses to be used by different parts of the code. Also, the previous code aligned the address to 16B, then to 64B by doing a few 48B loops (if needed) until the address was aligned. The main loop could not start until that 48B loop had finished and r4 was updated with the current address. Here we calculate the address used by the loop very early, so it can start sooner. The main loop now uses 2 pointers 128B apart to make pointer updates less frequent, and also unrolls 1 iteration to guarantee there is enough time between iterations to update the pointers, reducing stalled cycles. 4. Use new P10 instructions lxvp is used to load 32B with a single instruction, reducing contention in the load queue. vextractbm allows simplifying the tail code for the loop, replacing vbpermq and avoiding having to generate a permute control vector. Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com> Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
* powerpc64le: Use ifunc for _Float128 functions also in libcAndreas Schwab2021-04-013-8/+17
| | | | | | This fixes missing definition of math functions in libc in a static link that are no longer built for libm after commit 4898d9712b ("Avoid adding duplicated symbols into static libraries").
* powerpc: Add optimized llogb* for POWER9Raphael Moreira Zinsly2021-03-162-0/+43
| | | | | The POWER9 builtins used to improve the ilogb* functions can be used in the llogb* functions as well.
* powerpc: Add optimized ilogb* for POWER9Raphael Moreira Zinsly2021-03-162-0/+34
| | | | | | The instructions xsxexpdp and xsxexpqp introduced on POWER9 extract the exponent from a double-precision and quad-precision floating-point respectively, thus they can be used to improve ilogb, ilogbf and ilogbf128.
* Update copyright dates with scripts/update-copyrightsPaul Eggert2021-01-0219-19/+19
| | | | | | | | | | | | | | | | I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 6694 files FOO. I then removed trailing white space from benchtests/bench-pthread-locks.c and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this diagnostic from Savannah: remote: *** pre-commit check failed ... remote: *** error: lines with trailing whitespace found remote: error: hook declined to update refs/heads/master
* powerpc64le: Add glibc-hwcaps supportFlorian Weimer2020-12-043-0/+128
| | | | | The "power10" and "power9" subdirectories are selected in a way that matches the -mcpu=power10 and -mcpu=power9 options of GCC.
* powerpc64le: ifunc select *f128 routines in multiarch modePaul E. Murphy2020-11-3016-197/+817
| | | | | | | | | | | | | | | | | | | | | | | | | Programatically generate simple wrappers for interesting libm *f128 objects. Selected functions are transcendental functions or those with trivial compiler builtins. This can result in a 2-3x speedup (e.g logf128 and expf128). A second set of implementation files are generated which include the first implementation encountered along the search path. This usually works, except when a wrapper is overriden and makefile search order slightly diverges from include order. Likewise, wrapper object files are created for each generated file. These hold the ifunc selection routines which export ABI. Next, several shared headers are intercepted to control renaming of asm function redirects are used first, and sometimes macro renames if the former is impractical. Notably, if the request machine supports hardware IEEE128 (i.e POWER9 and newer) this ifunc machinery is disabled. Likewise existing ifunc support for float128 is consolidated into this (e.g sqrtf128 and fmaf128). Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc: Add optimized stpncpy for POWER9Raphael M Zinsly2020-11-122-1/+91
| | | | | | | Add stpncpy support into the POWER9 strncpy. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc: Add optimized strncpy for POWER9Raphael M Zinsly2020-11-121-0/+344
| | | | | | | | Similar to the strcpy P9 optimization, this version uses VSX to improve performance. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64le: guarantee a .gnu.attributes section [BZ #26220]Paul E. Murphy2020-07-211-0/+8
| | | | | | | | | | Upstream GCC 11 development is now building the ibm128 runtime support (in libgcc) without a .gnu.attributes section on ppc64le. Ensure we have one to replace by building one ibm128 file in libc and libm with attributes. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc: Add support for POWER10Tulio Magno Quites Machado Filho2020-06-294-0/+5
| | | | | | | | 1. Add the directories to hold POWER10 files. 2. Add support to select POWER10 libraries based on AT_PLATFORM. 3. Let submachine=power10 be set automatically.
* powerpc64le: refactor e_sqrtf128.cPaul E. Murphy2020-06-162-39/+7
| | | | | Combine both implementations into a single file to allow building twice with appropriate multiarch support when possible.
* powerpc64le: add optimized strlen for P9Paul E. Murphy2020-06-052-0/+214
| | | | | | | | | | | | | | | | This started as a trivial change to Anton's rawmemchr. I got carried away. This is a hybrid between P8's asympotically faster 64B checks with extremely efficient small string checks e.g <64B (and sometimes a little bit more depending on alignment). The second trick is to align to 64B by running a 48B checking loop 16B at a time until we naturally align to 64B (i.e checking 48/96/144 bytes/iteration based on the alignment after the first 5 comparisons). This allieviates the need to check page boundaries. Finally, explicly use the P7 strlen with the runtime loader when building P9. We need to be cautious about vector/vsx extensions here on P9 only builds.
* powerpc64le: use common fmaf128 implementationPaul E. Murphy2020-06-052-37/+3
| | | | | | | | | This defines the macro such that it should behave best on all supported powerpc targets. Likewise, this allows us to remove the ppc64le specific s_fmaf128.c. I have verified powerpc64le multiarch and powerpc64le power9 no-multiarch builds continue to generate optimize fmaf128.
* powerpc: Optimized rawmemchr for POWER9Anton Blanchard2020-05-181-0/+107
| | | | | | | | | | | This version uses vector instructions and is up to 60% faster on medium matches and up to 90% faster on long matches, compared to the POWER7 version. A few examples: __rawmemchr_power9 __rawmemchr_power7 Length 32, alignment 0: 2.27566 3.77765 Length 64, alignment 2: 2.46231 3.51064 Length 1024, alignment 0: 17.3059 32.6678
* powerpc: Optimized stpcpy for POWER9Anton Blanchard via Libc-alpha2020-05-182-15/+82
| | | | | | | | | | Add stpcpy support to the POWER9 strcpy. This is up to 40% faster on small strings and up to 90% faster on long relatively unaligned strings, compared to the POWER8 version. A few examples: __stpcpy_power9 __stpcpy_power8 Length 20, alignments in bytes 4/ 4: 2.58246 4.8788 Length 1024, alignments in bytes 1/ 6: 24.8186 47.8528
* powerpc: Optimized strcpy for POWER9Anton Blanchard via Libc-alpha2020-05-181-0/+144
| | | | | | | | | | This version uses VSX store vector with length instructions and is significantly faster on small strings and relatively unaligned large strings, compared to the POWER8 version. A few examples: __strcpy_power9 __strcpy_power8 Length 16, alignments in bytes 0/ 0: 2.52454 4.62695 Length 412, alignments in bytes 4/ 0: 11.6 22.9185