about summary refs log tree commit diff
path: root/sysdeps/powerpc/powerpc64
Commit message (Collapse)AuthorAgeFilesLines
...
* powerpc: Delete unneeded ELF_MACHINE_BEFORE_RTLD_RELOCFangrui Song2021-09-271-2/+0
| | | | Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* Add narrowing fma functionsJoseph Myers2021-09-222-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the narrowing fused multiply-add functions from TS 18661-1 / TS 18661-3 / C2X to glibc's libm: ffma, ffmal, dfmal, f32fmaf64, f32fmaf32x, f32xfmaf64 for all configurations; f32fmaf64x, f32fmaf128, f64fmaf64x, f64fmaf128, f32xfmaf64x, f32xfmaf128, f64xfmaf128 for configurations with _Float64x and _Float128; __f32fmaieee128 and __f64fmaieee128 aliases in the powerpc64le case (for calls to ffmal and dfmal when long double is IEEE binary128). Corresponding tgmath.h macro support is also added. The changes are mostly similar to those for the other narrowing functions previously added, especially that for sqrt, so the description of those generally applies to this patch as well. As with sqrt, I reused the same test inputs in auto-libm-test-in as for non-narrowing fma rather than adding extra or separate inputs for narrowing fma. The tests in libm-test-narrow-fma.inc also follow those for non-narrowing fma. The non-narrowing fma has a known bug (bug 6801) that it does not set errno on errors (overflow, underflow, Inf * 0, Inf - Inf). Rather than fixing this or having narrowing fma check for errors when non-narrowing does not (complicating the cases when narrowing fma can otherwise be an alias for a non-narrowing function), this patch does not attempt to check for errors from narrowing fma and set errno; the CHECK_NARROW_FMA macro is still present, but as a placeholder that does nothing, and this missing errno setting is considered to be covered by the existing bug rather than needing a separate open bug. missing-errno annotations are duly added to many of the auto-libm-test-in test inputs for fma. This completes adding all the new functions from TS 18661-1 to glibc, so will be followed by corresponding stdc-predef.h changes to define __STDC_IEC_60559_BFP__ and __STDC_IEC_60559_COMPLEX__, as the support for TS 18661-1 will be at a similar level to that for C standard floating-point facilities up to C11 (pragmas not implemented, but library functions done). (There are still further changes to be done to implement changes to the types of fromfp functions from N2548.) Tested as followed: natively with the full glibc testsuite for x86_64 (GCC 11, 7, 6) and x86 (GCC 11); with build-many-glibcs.py with GCC 11, 7 and 6; cross testing of math/ tests for powerpc64le, powerpc32 hard float, mips64 (all three ABIs, both hard and soft float). The different GCC versions are to cover the different cases in tgmath.h and tgmath.h tests properly (GCC 6 has _Float* only as typedefs in glibc headers, GCC 7 has proper _Float* support, GCC 8 adds __builtin_tgmath).
* powerpc: Fix unrecognized instruction errors with recent GCCPaul A. Clarke2021-09-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | Recent binutils commit b25f942e18d6ecd7ec3e2d2e9930eb4f996c258a changes the behavior of `.machine` directives to override, rather than augment, the base CPU. This can result in _reduced_ functionality when, for example, compiling for default machine "power8", but explicitly asking for ".machine power5", which loses Altivec instructions. In tst-ucontext-ppc64-vscr.c, while the instructions provoking the new error messages are bracketed by ".machine power5", which is ostensibly Power ISA 2.03 (POWER5), the POWER5 processor did not support the VSX subset, so these instructions are not recognized as "power5". Error: unrecognized opcode: `vspltisb' Error: unrecognized opcode: `vpkuwus' Error: unrecognized opcode: `mfvscr' Error: unrecognized opcode: `stvx' Manually adding the VSX subset via ".machine altivec" is sufficient. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* Add narrowing square root functionsJoseph Myers2021-09-102-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the narrowing square root functions from TS 18661-1 / TS 18661-3 / C2X to glibc's libm: fsqrt, fsqrtl, dsqrtl, f32sqrtf64, f32sqrtf32x, f32xsqrtf64 for all configurations; f32sqrtf64x, f32sqrtf128, f64sqrtf64x, f64sqrtf128, f32xsqrtf64x, f32xsqrtf128, f64xsqrtf128 for configurations with _Float64x and _Float128; __f32sqrtieee128 and __f64sqrtieee128 aliases in the powerpc64le case (for calls to fsqrtl and dsqrtl when long double is IEEE binary128). Corresponding tgmath.h macro support is also added. The changes are mostly similar to those for the other narrowing functions previously added, so the description of those generally applies to this patch as well. However, the not-actually-narrowing cases (where the two types involved in the function have the same floating-point format) are aliased to sqrt, sqrtl or sqrtf128 rather than needing a separately built not-actually-narrowing function such as was needed for add / sub / mul / div. Thus, there is no __nldbl_dsqrtl name for ldbl-opt because no such name was needed (whereas the other functions needed such a name since the only other name for that entry point was e.g. f32xaddf64, not reserved by TS 18661-1); the headers are made to arrange for sqrt to be called in that case instead. The DIAG_* calls in sysdeps/ieee754/soft-fp/s_dsqrtl.c are because they were observed to be needed in GCC 7 testing of riscv32-linux-gnu-rv32imac-ilp32. The other sysdeps/ieee754/soft-fp/ files added didn't need such DIAG_* in any configuration I tested with build-many-glibcs.py, but if they do turn out to be needed in more files with some other configuration / GCC version, they can always be added there. I reused the same test inputs in auto-libm-test-in as for non-narrowing sqrt rather than adding extra or separate inputs for narrowing sqrt. The tests in libm-test-narrow-sqrt.inc also follow those for non-narrowing sqrt. Tested as followed: natively with the full glibc testsuite for x86_64 (GCC 11, 7, 6) and x86 (GCC 11); with build-many-glibcs.py with GCC 11, 7 and 6; cross testing of math/ tests for powerpc64le, powerpc32 hard float, mips64 (all three ABIs, both hard and soft float). The different GCC versions are to cover the different cases in tgmath.h and tgmath.h tests properly (GCC 6 has _Float* only as typedefs in glibc headers, GCC 7 has proper _Float* support, GCC 8 adds __builtin_tgmath).
* Remove "Contributed by" linesSiddhesh Poyarekar2021-09-0315-15/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | We stopped adding "Contributed by" or similar lines in sources in 2012 in favour of git logs and keeping the Contributors section of the glibc manual up to date. Removing these lines makes the license header a bit more consistent across files and also removes the possibility of error in attribution when license blocks or files are copied across since the contributed-by lines don't actually reflect reality in those cases. Move all "Contributed by" and similar lines (Written by, Test by, etc.) into a new file CONTRIBUTED-BY to retain record of these contributions. These contributors are also mentioned in manual/contrib.texi, so we just maintain this additional record as a courtesy to the earlier developers. The following scripts were used to filter a list of files to edit in place and to clean up the CONTRIBUTED-BY file respectively. These were not added to the glibc sources because they're not expected to be of any use in future given that this is a one time task: https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dc https://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02 Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* Remove sysdeps/*/tls-macros.hFangrui Song2021-08-181-42/+0
| | | | | | | | They provide TLS_GD/TLS_LD/TLS_IE/TLS_IE macros for TLS testing. Now that we have migrated to __thread and tls_model attributes, these macros are unused and the tls-macros.h files can retire. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* powerpc64: Add checks for Altivec and VSX in ifunc selectionAnton Blanchard2021-08-0626-68/+139
| | | | | | | We'd like to support processors without Altivec or VSX, so check the relevant hwcap bits before selecting them. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64: Check cacheline size before using optimised memset routinesAnton Blanchard2021-08-062-10/+23
| | | | | | | A number of optimised memset routines assume the cacheline size is 128B, so we better check before using them. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64: Replace some PPC_FEATURE_HAS_VSX with PPC_FEATURE_ARCH_2_06Anton Blanchard2021-08-0620-38/+38
| | | | | | | | We use PPC_FEATURE_HAS_VSX to select a number of POWER7 optimised functions. These functions don't use any VSX instructions, so PPC_FEATURE_ARCH_2_06 seems like a better fit. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64le: Fix typo in configureAnton Blanchard2021-07-082-2/+2
| | | | | | The configure script checks for -mlong-double-128 but mentions -mlongdouble when it fails. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc64: Remove strcspn ifunc from the loaderTulio Magno Quites Machado Filho2021-07-081-0/+18
| | | | | | | | | | 5 years ago, commit 8f1b841e452dbb083112fd036033b7f4af506ba0 unintentionally added an ifunc to the loader. That modification has not caused any harm so far, but it doesn't add any value either, because the hwcap information is available later during libc initialization. Suggested-by: Anton Blanchard <anton@ozlabs.org>
* powerpc: optimize strcpy/stpcpy for POWER9/10Pedro Franco de Carvalho2021-07-011-71/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch modifies the current POWER9 implementation of strcpy and stpcpy to optimize it for POWER9/10. Since no new POWER10 instructions are used, the original POWER9 strcpy is modified instead of creating a new implementation for POWER10. This implementation is based on both the original POWER9 implementation of strcpy and the preamble of the new POWER10 implementation of strlen. The changes also affect stpcpy, which uses the same implementation with some additional code before returning. On POWER9, averaging improvements across the benchmark inputs (length/source alignment/destination alignment), for an experiment that ran the benchmark five times, bench-strcpy showed an improvement of 5.23%, and bench-stpcpy showed an improvement of 6.59%. On POWER10, bench-strcpy showed 13.16%, and bench-stpcpy showed 13.59%. The changes are: 1. Removed the null string optimization. Although this results in a few extra cycles for the null string, in combination with the second change, this resulted in improvements for for other cases. 2. Adapted the preamble from strlen for POWER10. This is the part of the function that handles up to the first 16 bytes of the string. 3. Increased number of unrolled iterations in the main loop to 6. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Tested-by: Matheus Castanho <msc@linux.ibm.com>
* Add build option to disable usage of scv on powerpcMatheus Castanho2021-06-101-8/+8
| | | | | | | | | | | | | | | | | Commit 68ab82f56690ada86ac1e0c46bad06ba189a10ef added support for the scv syscall ABI on powerpc. Since then systems that have kernel and processor support started using scv. However adding the proper support for a new syscall ABI requires changes to several other projects (e.g. qemu, valgrind, strace, kernel), which are gradually receiving support. Meanwhile, having a way to disable scv on glibc at build time can be useful for distros that may encounter conflicts with projects that still do not support the scv ABI, buying time until proper support is added. This commit adds a --disable-scv option that disables scv support and uses sc for all syscalls, like before commit 68ab82f56690ada86ac1e0c46bad06ba189a10ef. Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* Remove stale references to libdl.aFlorian Weimer2021-06-091-1/+0
| | | | | | | | Since commit 0c1c3a771eceec46e66ce1183cf988e2303bd373 ("dlfcn: Move dlopen into libc") libdl.a is empty, so linking against it is no longer necessary. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc: Optimized memcmp for power10Lucas A. M. Magalhaes2021-05-315-1/+218
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch was based on the __memcmp_power8 and the recent __strlen_power10. Improvements from __memcmp_power8: 1. Don't need alignment code. On POWER10 lxvp and lxvl do not generate alignment interrupts, so they are safe for use on caching-inhibited memory. Notice that the comparison on the main loop will wait for both VSR to be ready. Therefore aligning one of the input address does not improve performance. In order to align both registers a vperm is necessary which add too much overhead. 2. Uses new POWER10 instructions This code uses lxvp to decrease contention on load by loading 32 bytes per instruction. The vextractbm is used to have a smaller tail code for calculating the return value. 3. Performance improvement This version has around 35% better performance on average. I saw no performance regressions for any length or alignment. Thanks Matheus for helping me out with some details. Co-authored-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* powerpc: Fix handling of scv return error codes [BZ #27892]Nicholas Piggin2021-05-241-2/+3
| | | | | | | | | | | When using scv for templated ASM syscalls, current code interprets any negative return value as error, but the only valid error codes are in the range -4095..-1 according to the ABI. This commit also fixes 'signal.gen.test' strace test, where the issue was first identified. Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
* powerpc64le: Check HWCAP bits against compiler build flagsFlorian Weimer2021-05-191-0/+52
| | | | | | | When built with GCC 11.1 and -mcpu=power9, ld.so prints this error message when running on POWER8: Fatal glibc error: CPU lacks ISA 3.00 support (POWER9 or later required)
* powerpc: Add optimized rawmemchr for POWER10Matheus Castanho2021-05-176-27/+188
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reuse code for optimized strlen to implement a faster version of rawmemchr. This takes advantage of the same benefits provided by the strlen implementation, but needs some extra steps. __strlen_power10 code should be unchanged after this change. rawmemchr returns a pointer to the char found, while strlen returns only the length, so we have to take that into account when preparing the return value. To quickly check 64B, the loop on __strlen_power10 merges the whole block into 16B by using unsigned minimum vector operations (vminub) and checks if there are any \0 on the resulting vector. The same code is used by rawmemchr if the char c is 0. However, this approach does not work when c != 0. We first need to subtract each byte by c, so that the value we are looking for is converted to a 0, then taking the minimum and checking for nulls works again. The new code branches after it has compared ~256 bytes and chooses which of the two strategies above will be used in the main loop, based on the char c. This extra branch adds some overhead (~5%) for length ~256, but is quickly amortized by the faster loop for larger sizes. Compared to __rawmemchr_power9, this version is ~20% faster for length < 256. Because of the optimized main loop, the improvement becomes ~35% for c != 0 and ~50% for c = 0 for strings longer than 256. Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* powerpc64le: Fix ifunc selection for memset, memmove, bzero and bcopyRaoni Fassina Firmino2021-05-075-20/+22
| | | | | | | | | The hwcap2 check for the aforementioned functions should check for both PPC_FEATURE2_ARCH_3_1 and PPC_FEATURE2_HAS_ISEL but was mistakenly checking for any one of them, enabling isa 3.1 version of the functions in incompatible processors, like POWER8. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64le: Optimize memset for POWER10Raoni Fassina Firmino2021-04-306-1/+314
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This implementation is based on __memset_power8 and integrates a lot of suggestions from Anton Blanchard. The biggest difference is that it makes extensive use of stxvl to alignment and tail code to avoid branches and small stores. It has three main execution paths: a) "Short lengths" for lengths up to 64 bytes, avoiding as many branches as possible. b) "General case" for larger lengths, it has an alignment section using stxvl to avoid branches, a 128 bytes loop and then a tail code, again using stxvl with few branches. c) "Zeroing cache blocks" for lengths from 256 bytes upwards and set value being zero. It is mostly the __memset_power8 code but the alignment phase was simplified because, at this point, address is already 16-bytes aligned and also changed to use vector stores. The tail code was also simplified to reuse the general case tail. All unaligned stores use stxvl instructions that do not generate alignment interrupts on POWER10, making it safe to use on caching-inhibited memory. On average, this implementation provides something around 30% improvement when compared to __memset_power8. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64le: Optimize memcpy for POWER10Tulio Magno Quites Machado Filho2021-04-305-1/+238
| | | | | | | | | | | | | | | | | | This implementation is based on __memcpy_power8_cached and integrates suggestions from Anton Blanchard. It benefits from loads and stores with length for short lengths and for tail code, simplifying the code. All unaligned memory accesses use instructions that do not generate alignment interrupts on POWER10, making it safe to use on caching-inhibited memory. The main loop has also been modified in order to increase instruction throughput by reducing the dependency on updates from previous iterations. On average, this implementation provides around 30% improvement when compared to __memcpy_power7 and 10% improvement in comparison to __memcpy_power8_cached.
* powerpc64le: Optimized memmove for POWER10Lucas A. M. Magalhaes2021-04-308-7/+388
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch was initially based on the __memmove_power7 with some ideas from strncpy implementation for Power 9. Improvements from __memmove_power7: 1. Use lxvl/stxvl for alignment code. The code for Power 7 uses branches when the input is not naturally aligned to the width of a vector. The new implementation uses lxvl/stxvl instead which reduces pressure on GPRs. It also allows the removal of branch instructions, implicitly removing branch stalls and mispredictions. 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited memory. On Power 10 vector load and stores are safe to use on CI memory for addresses unaligned to 16B. This code takes advantage of this to do unaligned loads. The unaligned loads don't have a significant performance impact by themselves. However doing so decreases register pressure on GPRs and interdependence stalls on load/store pairs. This also improved readability as there are now less code paths for different alignments. Finally this reduces the overall code size. 3. Improved performance. This version runs on average about 30% better than memmove_power7 for lengths larger than 8KB. For input lengths shorter than 8KB the improvement is smaller, it has on average about 17% better performance. This version has a degradation of about 50% for input lengths in the 0 to 31 bytes range when dest is unaligned. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc: Add log IFUNC multiarch support for POWER10Raphael Moreira Zinsly2021-04-267-0/+101
| | | | | | | Checked on ppc64le built without --with-cpu, with --with-cpu=power9 and with --disable-multi-arch. Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
* powerpc: Add optimized strlen for POWER10Matheus Castanho2021-04-225-1/+230
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improvements compared to POWER9 version: 1. Take into account first 16B comparison for aligned strings The previous version compares the first 16B and increments r4 by the number of bytes until the address is 16B-aligned, then starts doing aligned loads at that address. For aligned strings, this causes the first 16B to be compared twice, because the increment is 0. Here we calculate the next 16B-aligned address differently, which avoids that issue. 2. Use simple comparisons for the first ~192 bytes The main loop is good for big strings, but comparing 16B each time is better for smaller strings. So after aligning the address to 16 Bytes, we check more 176B in 16B chunks. There may be some overlaps with the main loop for unaligned strings, but we avoid using the more aggressive strategy too soon, and also allow the loop to start at a 64B-aligned address. This greatly benefits smaller strings and avoids overlapping checks if the string is already aligned at a 64B boundary. 3. Reduce dependencies between load blocks caused by address calculation on loop Doing a precise time tracing on the code showed many loads in the loop were stalled waiting for updates to r4 from previous code blocks. This implementation avoids that as much as possible by using 2 registers (r4 and r5) to hold addresses to be used by different parts of the code. Also, the previous code aligned the address to 16B, then to 64B by doing a few 48B loops (if needed) until the address was aligned. The main loop could not start until that 48B loop had finished and r4 was updated with the current address. Here we calculate the address used by the loop very early, so it can start sooner. The main loop now uses 2 pointers 128B apart to make pointer updates less frequent, and also unrolls 1 iteration to guarantee there is enough time between iterations to update the pointers, reducing stalled cycles. 4. Use new P10 instructions lxvp is used to load 32B with a single instruction, reducing contention in the load queue. vextractbm allows simplifying the tail code for the loop, replacing vbpermq and avoiding having to generate a permute control vector. Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com> Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
* powerpc64le: Use ifunc for _Float128 functions also in libcAndreas Schwab2021-04-013-8/+17
| | | | | | This fixes missing definition of math functions in libc in a static link that are no longer built for libm after commit 4898d9712b ("Avoid adding duplicated symbols into static libraries").
* powerpc: Add optimized llogb* for POWER9Raphael Moreira Zinsly2021-03-162-0/+43
| | | | | The POWER9 builtins used to improve the ilogb* functions can be used in the llogb* functions as well.
* powerpc: Add optimized ilogb* for POWER9Raphael Moreira Zinsly2021-03-162-0/+34
| | | | | | The instructions xsxexpdp and xsxexpqp introduced on POWER9 extract the exponent from a double-precision and quad-precision floating-point respectively, thus they can be used to improve ilogb, ilogbf and ilogbf128.
* Reduce the statically linked startup code [BZ #23323]Florian Weimer2021-02-251-2/+2
| | | | | | | | | | | | | | | | | | | It turns out the startup code in csu/elf-init.c has a perfect pair of ROP gadgets (see Marco-Gisbert and Ripoll-Ripoll, "return-to-csu: A New Method to Bypass 64-bit Linux ASLR"). These functions are not needed in dynamically-linked binaries because DT_INIT/DT_INIT_ARRAY are already processed by the dynamic linker. However, the dynamic linker skipped the main program for some reason. For maximum backwards compatibility, this is not changed, and instead, the main map is consulted from __libc_start_main if the init function argument is a NULL pointer. For statically linked binaries, the old approach based on linker symbols is still used because there is nothing else available. A new symbol version __libc_start_main@@GLIBC_2.34 is introduced because new binaries running on an old libc would not run their ELF constructors, leading to difficult-to-debug issues.
* powerpc64: Workaround sigtramp vdso return callRaoni Fassina Firmino2021-01-281-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A not so recent kernel change[1] changed how the trampoline `__kernel_sigtramp_rt64` is used to call signal handlers. This was exposed on the test misc/tst-sigcontext-get_pc Before kernel 5.9, the kernel set LR to the trampoline address and jumped directly to the signal handler, and at the end the signal handler, as any other function, would `blr` to the address set. In other words, the trampoline was executed just at the end of the signal handler and the only thing it did was call sigreturn. But since kernel 5.9 the kernel set CTRL to the signal handler and calls to the trampoline code, the trampoline then `bctrl` to the address in CTRL, setting the LR to the next instruction in the middle of the trampoline, when the signal handler returns, the rest of the trampoline code executes the same code as before. Here is the full trampoline code as of kernel 5.11.0-rc5 for reference: V_FUNCTION_BEGIN(__kernel_sigtramp_rt64) .Lsigrt_start: bctrl /* call the handler */ addi r1, r1, __SIGNAL_FRAMESIZE li r0,__NR_rt_sigreturn sc .Lsigrt_end: V_FUNCTION_END(__kernel_sigtramp_rt64) This new behavior breaks how `backtrace()` uses to detect the trampoline frame to correctly reconstruct the stack frame when it is called from inside a signal handling. This workaround rely on the fact that the trampoline code is at very least two (maybe 3?) instructions in size (as it is in the 32 bits version, only on `li` and `sc`), so it is safe to check the return address be in the range __kernel_sigtramp_rt64 .. + 4. [1] subject: powerpc/64/signal: Balance return predictor stack in signal trampoline commit: 0138ba5783ae0dcc799ad401a1e8ac8333790df9 url: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0138ba5783ae0dcc799ad401a1e8ac8333790df9 Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc64: Select POWER9 machine for the scv instructionFlorian Weimer2021-01-221-0/+3
| | | | | | | | | It is not available with the baseline ISA. Fixes commit 68ab82f56690ada86ac1e0c46bad06ba189a10ef ("powerpc: Runtime selection between sc and scv for syscalls"). Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* Update copyright dates with scripts/update-copyrightsPaul Eggert2021-01-02265-265/+265
| | | | | | | | | | | | | | | | I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 6694 files FOO. I then removed trailing white space from benchtests/bench-pthread-locks.c and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this diagnostic from Savannah: remote: *** pre-commit check failed ... remote: *** error: lines with trailing whitespace found remote: error: hook declined to update refs/heads/master
* powerpc: Runtime selection between sc and scv for syscallsMatheus Castanho2020-12-301-6/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux kernel v5.9 added support for system calls using the scv instruction for POWER9 and later. The new codepath provides better performance (see below) if compared to using sc. For the foreseeable future, both sc and scv mechanisms will co-exist, so this patch enables glibc to do a runtime check and use scv when it is available. Before issuing the system call to the kernel, we check hwcap2 in the TCB for PPC_FEATURE2_SCV to see if scv is supported by the kernel. If not, we fallback to sc and keep the old behavior. The kernel implements a different error return convention for scv, so when returning from a system call we need to handle the return value differently depending on the instruction we used to enter the kernel. For syscalls implemented in ASM, entry and exit are implemented by different macros (PSEUDO and PSEUDO_RET, resp.), which may be used in sequence (e.g. for templated syscalls) or with other instructions in between (e.g. clone). To avoid accessing the TCB a second time on PSEUDO_RET to check which instruction we used, the value read from hwcap2 is cached on a non-volatile register. This is not needed when using INTERNAL_SYSCALL macro, since entry and exit are bundled into the same inline asm directive. The dynamic loader may issue syscalls before the TCB has been setup so it always uses sc with no extra checks. For the static case, there is no compile-time way to determine if we are inside startup code, so we also check the value of the thread pointer before effectively accessing the TCB. For such situations in which the availability of scv cannot be determined, sc is always used. Support for scv in syscalls implemented in their own ASM file (clone and vfork) will be added later. For now simply use sc as before. Average performance over 1M calls for each syscall "type": - stat: C wrapper calling INTERNAL_SYSCALL - getpid: templated ASM syscall - syscall: call to gettid using syscall function Standard: stat : 1.573445 us / ~3619 cycles getpid : 0.164986 us / ~379 cycles syscall : 0.162743 us / ~374 cycles With scv: stat : 1.537049 us / ~3535 cycles <~ -84 cycles / -2.32% getpid : 0.109923 us / ~253 cycles <~ -126 cycles / -33.25% syscall : 0.116410 us / ~268 cycles <~ -106 cycles / -28.34% Tested on powerpc, powerpc64, powerpc64le (with and without scv) Tested-by: Lucas A. M. Magalhães <lamm@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64le: Add glibc-hwcaps supportFlorian Weimer2020-12-043-0/+128
| | | | | The "power10" and "power9" subdirectories are selected in a way that matches the -mcpu=power10 and -mcpu=power9 options of GCC.
* powerpc64le: ifunc select *f128 routines in multiarch modePaul E. Murphy2020-11-3016-197/+817
| | | | | | | | | | | | | | | | | | | | | | | | | Programatically generate simple wrappers for interesting libm *f128 objects. Selected functions are transcendental functions or those with trivial compiler builtins. This can result in a 2-3x speedup (e.g logf128 and expf128). A second set of implementation files are generated which include the first implementation encountered along the search path. This usually works, except when a wrapper is overriden and makefile search order slightly diverges from include order. Likewise, wrapper object files are created for each generated file. These hold the ifunc selection routines which export ABI. Next, several shared headers are intercepted to control renaming of asm function redirects are used first, and sometimes macro renames if the former is impractical. Notably, if the request machine supports hardware IEEE128 (i.e POWER9 and newer) this ifunc machinery is disabled. Likewise existing ifunc support for float128 is consolidated into this (e.g sqrtf128 and fmaf128). Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc: Eliminate UP macro conditionalsFlorian Weimer2020-11-131-3/+1
| | | | | | The macro is never defined. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc: Add optimized stpncpy for POWER9Raphael M Zinsly2020-11-126-2/+135
| | | | | | | Add stpncpy support into the POWER9 strncpy. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc: Add optimized strncpy for POWER9Raphael M Zinsly2020-11-125-1/+391
| | | | | | | | Similar to the strcpy P9 optimization, this version uses VSX to improve performance. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc: fix ifunc implementation list for POWER9 strlen and stpcpyRaphael Moreira Zinsly2020-09-171-2/+2
| | | | | __strlen_power9 and __stpcpy_power9 were added to their ifunc lists using the wrong function names.
* powerpc64le: guarantee a .gnu.attributes section [BZ #26220]Paul E. Murphy2020-07-211-0/+8
| | | | | | | | | | Upstream GCC 11 development is now building the ibm128 runtime support (in libgcc) without a .gnu.attributes section on ppc64le. Ensure we have one to replace by building one ibm128 file in libc and libm with attributes. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64: Fix calls when r2 is not used [BZ #26173]Tulio Magno Quites Machado Filho2020-07-105-3/+48
| | | | | | | | | Teach the linker that __mcount_internal, __sigjmp_save_symbol, __syscall_error and __GI_exit do not use r2, so that it does not need to recover r2 after the call. Test at configure time if the assembler supports @notoc and define USE_PPC64_NOTOC.
* powerpc: Add support for POWER10Tulio Magno Quites Machado Filho2020-06-298-0/+10
| | | | | | | | 1. Add the directories to hold POWER10 files. 2. Add support to select POWER10 libraries based on AT_PLATFORM. 3. Let submachine=power10 be set automatically.
* powerpc64le: refactor e_sqrtf128.cPaul E. Murphy2020-06-162-39/+7
| | | | | Combine both implementations into a single file to allow building twice with appropriate multiarch support when possible.
* powerpc64le: add optimized strlen for P9Paul E. Murphy2020-06-056-1/+226
| | | | | | | | | | | | | | | | This started as a trivial change to Anton's rawmemchr. I got carried away. This is a hybrid between P8's asympotically faster 64B checks with extremely efficient small string checks e.g <64B (and sometimes a little bit more depending on alignment). The second trick is to align to 64B by running a 48B checking loop 16B at a time until we naturally align to 64B (i.e checking 48/96/144 bytes/iteration based on the alignment after the first 5 comparisons). This allieviates the need to check page boundaries. Finally, explicly use the P7 strlen with the runtime loader when building P9. We need to be cautious about vector/vsx extensions here on P9 only builds.
* powerpc64le: use common fmaf128 implementationPaul E. Murphy2020-06-052-37/+3
| | | | | | | | | This defines the macro such that it should behave best on all supported powerpc targets. Likewise, this allows us to remove the ppc64le specific s_fmaf128.c. I have verified powerpc64le multiarch and powerpc64le power9 no-multiarch builds continue to generate optimize fmaf128.
* powerpc: Optimized rawmemchr for POWER9Anton Blanchard2020-05-185-3/+145
| | | | | | | | | | | This version uses vector instructions and is up to 60% faster on medium matches and up to 90% faster on long matches, compared to the POWER7 version. A few examples: __rawmemchr_power9 __rawmemchr_power7 Length 32, alignment 0: 2.27566 3.77765 Length 64, alignment 2: 2.46231 3.51064 Length 1024, alignment 0: 17.3059 32.6678
* powerpc: Optimized stpcpy for POWER9Anton Blanchard via Libc-alpha2020-05-186-21/+123
| | | | | | | | | | Add stpcpy support to the POWER9 strcpy. This is up to 40% faster on small strings and up to 90% faster on long relatively unaligned strings, compared to the POWER8 version. A few examples: __stpcpy_power9 __stpcpy_power8 Length 20, alignments in bytes 4/ 4: 2.58246 4.8788 Length 1024, alignments in bytes 1/ 6: 24.8186 47.8528
* powerpc: Optimized strcpy for POWER9Anton Blanchard via Libc-alpha2020-05-185-1/+182
| | | | | | | | | | This version uses VSX store vector with length instructions and is significantly faster on small strings and relatively unaligned large strings, compared to the POWER8 version. A few examples: __strcpy_power9 __strcpy_power8 Length 16, alignments in bytes 0/ 0: 2.52454 4.62695 Length 412, alignments in bytes 4/ 0: 11.6 22.9185
* powerpc64le/power9: guard power9 strcmp against rtld usage [BZ# 25905]Paul E. Murphy2020-05-041-0/+2
| | | | | | | | | | | | | | | | strcmp is used while resolving PLT references. Vector registers should not be used during this. The P9 strcmp makes heavy use of vector registers, so it should be avoided in rtld. This prevents quiet vector register corruption when glibc is configured with --disable-multi-arch and --with-cpu=power9. This can be seen with test-float64x-compat_totalordermag during the first call into totalordermagf64x@GLIBC_2.27. Add a guard to fallback to the power8 implementation when building power9 strcmp for libraries other than libc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc64le: Enable support for IEEE long doubleGabriel F. T. Gomes2020-04-302-0/+5
| | | | | | | | | | | | | | On platforms where long double may have two different formats, i.e.: the same format as double (64-bits) or something else (128-bits), building with -mlong-double-128 is the default and function calls in the user program match the name of the function in Glibc. When building with -mlong-double-64, Glibc installed headers redirect such calls to the appropriate function. Likewise, the internals of glibc are now built against IEEE long double. However, the only (minimally) notable usage of long double is difftime. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64le: blacklist broken GCC compilers (e.g GCC 7.5.0)Paul E. Murphy2020-04-302-0/+42
| | | | | | | | | | | GCC 7.5.0 (PR94200) will refuse to compile if both -mabi=% and -mlong-double-128 are passed on the command line. Surprisingly, it will work happily if the latter is not. For the sake of maintaining status quo, test for and blacklist such compilers. Tested with a GCC 8.3.1 and GCC 7.5.0 compiler for ppc64le. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>