about summary refs log tree commit diff
path: root/sysdeps/powerpc
Commit message (Collapse)AuthorAgeFilesLines
...
* nptl: Add <thread_pointer.h> for defining __thread_pointerFlorian Weimer2021-12-091-0/+33
| | | | | | | | | <tls.h> already contains a definition that is quite similar, but it is not consistent across architectures. Only architectures for which rseq support is added are covered. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* csu: Always use __executable_start in gmon-start.cFlorian Weimer2021-12-051-37/+0
| | | | | | | | | | | | | | Current binutils defines __executable_start as the lowest text address, so using the entry point address as a fallback is no longer necessary. As a result, overriding <entry.h> is only necessary if the entry point is not called _start. The previous approach to define __ASSEMBLY__ to suppress the declaration breaks if headers included by <entry.h> are not compatible with __ASSEMBLY__. This happens with rseq integration because it is necessary to include kernel headers in more places. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* nptl: Increase default TCB alignment to 32Florian Weimer2021-12-031-3/+0
| | | | | | | | | | rseq support will use a 32-byte aligned field in struct pthread, so the whole struct needs to have at least that alignment. nptl/tst-tls3mod.c uses TCB_ALIGNMENT, therefore include <descr.h> to obtain the fallback definition. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* powerpc64[le]: Fix CFI and LR save address for asm syscalls [BZ #28532]Matheus Castanho2021-11-301-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syscalls based on the assembly templates are missing CFI for r31, which gets clobbered when scv is used, and info for LR is inaccurate, placed in the wrong LOC and not using the proper offset. LR was also being saved to the callee's frame, while the ABI mandates it to be saved to the caller's frame. These are fixed by this commit. After this change: $ readelf -wF libc.so.6 | grep 0004b9d4.. -A 7 && objdump --disassemble=kill libc.so.6 00004a48 0000000000000020 00004a4c FDE cie=00000000 pc=000000000004b9d4..000000000004ba3c LOC CFA r31 ra 000000000004b9d4 r1+0 u u 000000000004b9e4 r1+48 u u 000000000004b9e8 r1+48 c-16 u 000000000004b9fc r1+48 c-16 c+16 000000000004ba08 r1+48 c-16 000000000004ba18 r1+48 u 000000000004ba1c r1+0 u libc.so.6: file format elf64-powerpcle Disassembly of section .text: 000000000004b9d4 <kill>: 4b9d4: 1f 00 4c 3c addis r2,r12,31 4b9d8: 2c c3 42 38 addi r2,r2,-15572 4b9dc: 25 00 00 38 li r0,37 4b9e0: d1 ff 21 f8 stdu r1,-48(r1) 4b9e4: 20 00 e1 fb std r31,32(r1) 4b9e8: 98 8f ed eb ld r31,-28776(r13) 4b9ec: 10 00 ff 77 andis. r31,r31,16 4b9f0: 1c 00 82 41 beq 4ba0c <kill+0x38> 4b9f4: a6 02 28 7d mflr r9 4b9f8: 40 00 21 f9 std r9,64(r1) 4b9fc: 01 00 00 44 scv 0 4ba00: 40 00 21 e9 ld r9,64(r1) 4ba04: a6 03 28 7d mtlr r9 4ba08: 08 00 00 48 b 4ba10 <kill+0x3c> 4ba0c: 02 00 00 44 sc 4ba10: 00 00 bf 2e cmpdi cr5,r31,0 4ba14: 20 00 e1 eb ld r31,32(r1) 4ba18: 30 00 21 38 addi r1,r1,48 4ba1c: 18 00 96 41 beq cr5,4ba34 <kill+0x60> 4ba20: 01 f0 20 39 li r9,-4095 4ba24: 40 48 23 7c cmpld r3,r9 4ba28: 20 00 e0 4d bltlr+ 4ba2c: d0 00 63 7c neg r3,r3 4ba30: 08 00 00 48 b 4ba38 <kill+0x64> 4ba34: 20 00 e3 4c bnslr+ 4ba38: c8 32 fe 4b b 2ed00 <__syscall_error> ... 4ba44: 40 20 0c 00 .long 0xc2040 4ba48: 68 00 00 00 .long 0x68 4ba4c: 06 00 5f 5f rlwnm r31,r26,r0,0,3 4ba50: 6b 69 6c 6c xoris r12,r3,26987
* powerpc: Define USE_PPC64_NOTOC iff compiler supports itAdhemerval Zanella2021-11-222-24/+43
| | | | | | | | | | | | | | | | | | The @notoc usage only yields an advantage on ISA 3.1+ machine (power10) and for ld.bfd also when it sees pcrel relocations used on the code (generated if compiler targets ISA 3.1+). On bfd case ISA 3.1+ instruction on stubs are used iff linker also sees the new pc-relative relocations (for instance R_PPC64_D34), otherwise it generates default stubs (ppc64_elf_check_relocs:4700). This patch also help on linkers that do not implement this optimization, since building for older ISA (such as 3.0 / power9) will also trigger power10 stubs generation in the assembly code uses the NOTOC imacro. Checked on powerpc64le-linux-gnu. Reviewed-by: Fangrui Song <maskray@google.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* [powerpc] Tighten contraints for asm constant parametersPaul A. Clarke2021-11-033-9/+9
| | | | | | | | | | | | There are a few places where only known numeric values are acceptable for `asm` parameters, yet the constraint "i" is used. "i" can include "symbolic constants whose values will be known only at assembly time or later." Use "n" instead of "i" where known numeric values are required. Suggested-by: Segher Boessenkool <segher@kernel.crashing.org> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* String: Add hidden defs for __memcmpeq() to enable internal usageNoah Goldstein2021-10-2612-0/+18
| | | | | | | | No bug. This commit adds hidden defs for all declarations of __memcmpeq. This enables usage of __memcmpeq without the PLT for usage internal to GLIBC.
* String: Add support for __memcmpeq() ABI on all targetsNoah Goldstein2021-10-2614-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | No bug. This commit adds support for __memcmpeq() as a new ABI for all targets. In this commit __memcmpeq() is implemented only as an alias to the corresponding targets memcmp() implementation. __memcmpeq() is added as a new symbol starting with GLIBC_2.35 and defined in string.h with comments explaining its behavior. Basic tests that it is callable and works where added in string/tester.c As discussed in the proposal "Add new ABI '__memcmpeq()' to libc" __memcmpeq() is essentially a reserved namespace for bcmp(). The means is shares the same specifications as memcmp() except the return value for non-equal byte sequences is any non-zero value. This is less strict than memcmp()'s return value specification and can be better optimized when a boolean return is all that is needed. __memcmpeq() is meant to only be called by compilers if they can prove that the return value of a memcmp() call is only used for its boolean value. All tests in string/tester.c passed. As well build succeeds on x86_64-linux-gnu target.
* powerpc: Remove backtrace implementationAdhemerval Zanella2021-10-202-250/+0
| | | | | | | | | | | | | | | The powerpc optimization to provide a fast stacktrace requires some ad-hoc code to handle Linux signal frames and the change is fragile once the kernel decides to slight change its execution sequence [1]. The generic implementation work as-is and it should be future proof since the kernel provides the expected CFI directives in vDSO shared page. Checked on powerpc-linux-gnu, powerpc64le-linux-gnu, and powerpc64-linux-gnu. [1] https://sourceware.org/pipermail/libc-alpha/2021-January/122027.html
* elf: Fix dynamic-link.h usage on rtld.cAdhemerval Zanella2021-10-144-21/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 4af6982e4c fix does not fully handle RTLD_BOOTSTRAP usage on rtld.c due two issues: 1. RTLD_BOOTSTRAP is also used on dl-machine.h on various architectures and it changes the semantics of various machine relocation functions. 2. The elf_get_dynamic_info() change was done sideways, previously to 490e6c62aa get-dynamic-info.h was included by the first dynamic-link.h include *without* RTLD_BOOTSTRAP being defined. It means that the code within elf_get_dynamic_info() that uses RTLD_BOOTSTRAP is in fact unused. To fix 1. this patch now includes dynamic-link.h only once with RTLD_BOOTSTRAP defined. The ELF_DYNAMIC_RELOCATE call will now have the relocation fnctions with the expected semantics for the loader. And to fix 2. part of 4af6982e4c is reverted (the check argument elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP pieces are removed. To reorganize the includes the static TLS definition is moved to its own header to avoid a circular dependency (it is defined on dynamic-link.h and dl-machine.h requires it at same time other dynamic-link.h definition requires dl-machine.h defitions). Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL are moved to its own header. Only ancient ABIs need special values (arm, i386, and mips), so a generic one is used as default. The powerpc Elf64_FuncDesc is also moved to its own header, since csu code required its definition (which would require either include elf/ folder or add a full path with elf/). Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32, and powerpc64le. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* elf: Avoid nested functions in the loader [BZ #27220]Fangrui Song2021-10-072-17/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | dynamic-link.h is included more than once in some elf/ files (rtld.c, dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested functions. This harms readability and the nested functions usage is the biggest obstacle prevents Clang build (Clang doesn't support GCC nested functions). The key idea for unnesting is to add extra parameters (struct link_map *and struct r_scope_elm *[]) to RESOLVE_MAP, ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a], elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired by Stan Shebs' ppc64/x86-64 implementation in the google/grte/v5-2.27/master which uses mixed extra parameters and static variables.) Future simplification: * If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM, elf_machine_runtime_setup can drop the `scope` parameter. * If TLSDESC no longer need to be in elf_machine_lazy_rel, elf_machine_lazy_rel can drop the `scope` parameter. Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32, sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64. In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu and riscv64-linux-gnu-rv64imafdc-lp64d. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc: update libm test ulpsAdhemerval Zanella2021-10-061-1/+1
| | | | | Update after commit 6bbf7298323bf31bc43494b2201465a449778e10 (Fixed inaccuracy of j0f (BZ #28185)).
* powerpc: Fix unrecognized instruction errors with recent binutilsPaul A. Clarke2021-09-292-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent versions of binutils (with commit b25f942e18d6ecd7ec3e2d2e9930eb4f996c258a) stopped preserving "sticky" options across a base `.machine` directive, nullifying the use of passing "-many" through GCC to the assembler. As a result, some instructions which were recognized even under older, more stringent `.machine` directives become unrecognized instructions in that context. In `sysdeps/powerpc/tst-set_ppr.c`, the use of the `mfppr32` extended mnemonic became unrecognized, as the default compilation with GCC for 32bit powerpc adds a `.machine ppc` in the resulting assembly, so the command line option `-Wa,-many` is essentially ignored, and the ISA 2.06 instructions and mnemonics, like `mfppr32`, are unrecognized. The compilation of `sysdeps/powerpc/tst-set_ppr.c` fails with: Error: unrecognized opcode: `mfppr32' Add appropriate `.machine` directives in the assembly to bracket the `mfppr32` instruction. Part of a 2019 fix (commit 9250e6610fdb0f3a6f238d2813e319a41fb7a810) to the above test's Makefile to add `-many` to the compilation when GCC itself stopped passing `-many` to the assember no longer has any effect, so remove that. Reported-by: Joseph Myers <joseph@codesourcery.com>
* Add fmaximum, fminimum functionsJoseph Myers2021-09-281-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | C2X adds new <math.h> functions for floating-point maximum and minimum, corresponding to the new operations that were added in IEEE 754-2019 because of concerns about the old operations not being associative in the presence of signaling NaNs. fmaximum and fminimum handle NaNs like most <math.h> functions (any NaN argument means the result is a quiet NaN). fmaximum_num and fminimum_num handle both quiet and signaling NaNs the way fmax and fmin handle quiet NaNs (if one argument is a number and the other is a NaN, return the number), but still raise "invalid" for a signaling NaN argument, making them exceptions to the normal rule that a function with a floating-point result raising "invalid" also returns a quiet NaN. fmaximum_mag, fminimum_mag, fmaximum_mag_num and fminimum_mag_num are corresponding functions returning the argument with greatest or least absolute value. All these functions also treat +0 as greater than -0. There are also corresponding <tgmath.h> type-generic macros. Add these functions to glibc. The implementations use type-generic templates based on those for fmax, fmin, fmaxmag and fminmag, and test inputs are based on those for those functions with appropriate adjustments to the expected results. The RISC-V maintainers might wish to add optimized versions of fmaximum_num and fminimum_num (for float and double), since RISC-V (F extension version 2.2 and later) provides instructions corresponding to those functions - though it might be at least as useful to add architecture-independent built-in functions to GCC and teach the RISC-V back end to expand those functions inline, which is what you generally want for functions that can be implemented with a single instruction. Tested for x86_64 and x86, and with build-many-glibcs.py.
* powerpc: Delete unneeded ELF_MACHINE_BEFORE_RTLD_RELOCFangrui Song2021-09-272-4/+0
| | | | Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* Add narrowing fma functionsJoseph Myers2021-09-223-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the narrowing fused multiply-add functions from TS 18661-1 / TS 18661-3 / C2X to glibc's libm: ffma, ffmal, dfmal, f32fmaf64, f32fmaf32x, f32xfmaf64 for all configurations; f32fmaf64x, f32fmaf128, f64fmaf64x, f64fmaf128, f32xfmaf64x, f32xfmaf128, f64xfmaf128 for configurations with _Float64x and _Float128; __f32fmaieee128 and __f64fmaieee128 aliases in the powerpc64le case (for calls to ffmal and dfmal when long double is IEEE binary128). Corresponding tgmath.h macro support is also added. The changes are mostly similar to those for the other narrowing functions previously added, especially that for sqrt, so the description of those generally applies to this patch as well. As with sqrt, I reused the same test inputs in auto-libm-test-in as for non-narrowing fma rather than adding extra or separate inputs for narrowing fma. The tests in libm-test-narrow-fma.inc also follow those for non-narrowing fma. The non-narrowing fma has a known bug (bug 6801) that it does not set errno on errors (overflow, underflow, Inf * 0, Inf - Inf). Rather than fixing this or having narrowing fma check for errors when non-narrowing does not (complicating the cases when narrowing fma can otherwise be an alias for a non-narrowing function), this patch does not attempt to check for errors from narrowing fma and set errno; the CHECK_NARROW_FMA macro is still present, but as a placeholder that does nothing, and this missing errno setting is considered to be covered by the existing bug rather than needing a separate open bug. missing-errno annotations are duly added to many of the auto-libm-test-in test inputs for fma. This completes adding all the new functions from TS 18661-1 to glibc, so will be followed by corresponding stdc-predef.h changes to define __STDC_IEC_60559_BFP__ and __STDC_IEC_60559_COMPLEX__, as the support for TS 18661-1 will be at a similar level to that for C standard floating-point facilities up to C11 (pragmas not implemented, but library functions done). (There are still further changes to be done to implement changes to the types of fromfp functions from N2548.) Tested as followed: natively with the full glibc testsuite for x86_64 (GCC 11, 7, 6) and x86 (GCC 11); with build-many-glibcs.py with GCC 11, 7 and 6; cross testing of math/ tests for powerpc64le, powerpc32 hard float, mips64 (all three ABIs, both hard and soft float). The different GCC versions are to cover the different cases in tgmath.h and tgmath.h tests properly (GCC 6 has _Float* only as typedefs in glibc headers, GCC 7 has proper _Float* support, GCC 8 adds __builtin_tgmath).
* Adjust new narrowing div/mul tests for IBM long double, update powerpc ULPsJoseph Myers2021-09-221-0/+3
| | | | | | Testing for powerpc shows some of the new narrowing div/mul tests need XFAILing for IBM long double and some ULPs updates are needed for those tests.
* powerpc: Fix unrecognized instruction errors with recent GCCPaul A. Clarke2021-09-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | Recent binutils commit b25f942e18d6ecd7ec3e2d2e9930eb4f996c258a changes the behavior of `.machine` directives to override, rather than augment, the base CPU. This can result in _reduced_ functionality when, for example, compiling for default machine "power8", but explicitly asking for ".machine power5", which loses Altivec instructions. In tst-ucontext-ppc64-vscr.c, while the instructions provoking the new error messages are bracketed by ".machine power5", which is ostensibly Power ISA 2.03 (POWER5), the POWER5 processor did not support the VSX subset, so these instructions are not recognized as "power5". Error: unrecognized opcode: `vspltisb' Error: unrecognized opcode: `vpkuwus' Error: unrecognized opcode: `mfvscr' Error: unrecognized opcode: `stvx' Manually adding the VSX subset via ".machine altivec" is sufficient. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* elf: Remove THREAD_GSCOPE_IN_TCBSergey Bugaev2021-09-161-1/+0
| | | | | | | | | All the ports now have THREAD_GSCOPE_IN_TCB set to 1. Remove all support for !THREAD_GSCOPE_IN_TCB, along with the definition itself. Signed-off-by: Sergey Bugaev <bugaevc@gmail.com> Message-Id: <20210915171110.226187-4-bugaevc@gmail.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
* Add narrowing square root functionsJoseph Myers2021-09-103-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the narrowing square root functions from TS 18661-1 / TS 18661-3 / C2X to glibc's libm: fsqrt, fsqrtl, dsqrtl, f32sqrtf64, f32sqrtf32x, f32xsqrtf64 for all configurations; f32sqrtf64x, f32sqrtf128, f64sqrtf64x, f64sqrtf128, f32xsqrtf64x, f32xsqrtf128, f64xsqrtf128 for configurations with _Float64x and _Float128; __f32sqrtieee128 and __f64sqrtieee128 aliases in the powerpc64le case (for calls to fsqrtl and dsqrtl when long double is IEEE binary128). Corresponding tgmath.h macro support is also added. The changes are mostly similar to those for the other narrowing functions previously added, so the description of those generally applies to this patch as well. However, the not-actually-narrowing cases (where the two types involved in the function have the same floating-point format) are aliased to sqrt, sqrtl or sqrtf128 rather than needing a separately built not-actually-narrowing function such as was needed for add / sub / mul / div. Thus, there is no __nldbl_dsqrtl name for ldbl-opt because no such name was needed (whereas the other functions needed such a name since the only other name for that entry point was e.g. f32xaddf64, not reserved by TS 18661-1); the headers are made to arrange for sqrt to be called in that case instead. The DIAG_* calls in sysdeps/ieee754/soft-fp/s_dsqrtl.c are because they were observed to be needed in GCC 7 testing of riscv32-linux-gnu-rv32imac-ilp32. The other sysdeps/ieee754/soft-fp/ files added didn't need such DIAG_* in any configuration I tested with build-many-glibcs.py, but if they do turn out to be needed in more files with some other configuration / GCC version, they can always be added there. I reused the same test inputs in auto-libm-test-in as for non-narrowing sqrt rather than adding extra or separate inputs for narrowing sqrt. The tests in libm-test-narrow-sqrt.inc also follow those for non-narrowing sqrt. Tested as followed: natively with the full glibc testsuite for x86_64 (GCC 11, 7, 6) and x86 (GCC 11); with build-many-glibcs.py with GCC 11, 7 and 6; cross testing of math/ tests for powerpc64le, powerpc32 hard float, mips64 (all three ABIs, both hard and soft float). The different GCC versions are to cover the different cases in tgmath.h and tgmath.h tests properly (GCC 6 has _Float* only as typedefs in glibc headers, GCC 7 has proper _Float* support, GCC 8 adds __builtin_tgmath).
* Remove "Contributed by" linesSiddhesh Poyarekar2021-09-0370-74/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | We stopped adding "Contributed by" or similar lines in sources in 2012 in favour of git logs and keeping the Contributors section of the glibc manual up to date. Removing these lines makes the license header a bit more consistent across files and also removes the possibility of error in attribution when license blocks or files are copied across since the contributed-by lines don't actually reflect reality in those cases. Move all "Contributed by" and similar lines (Written by, Test by, etc.) into a new file CONTRIBUTED-BY to retain record of these contributions. These contributors are also mentioned in manual/contrib.texi, so we just maintain this additional record as a courtesy to the earlier developers. The following scripts were used to filter a list of files to edit in place and to clean up the CONTRIBUTED-BY file respectively. These were not added to the glibc sources because they're not expected to be of any use in future given that this is a one time task: https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dc https://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02 Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* Remove sysdeps/*/tls-macros.hFangrui Song2021-08-183-94/+0
| | | | | | | | They provide TLS_GD/TLS_LD/TLS_IE/TLS_IE macros for TLS testing. Now that we have migrated to __thread and tls_model attributes, these macros are unused and the tls-macros.h files can retire. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* elf: Drop elf/tls-macros.h in favor of __thread and tls_model attributes [BZ ↵Fangrui Song2021-08-162-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | #28152] [BZ #28205] elf/tls-macros.h was added for TLS testing when GCC did not support __thread. __thread and tls_model attributes are mature now and have been used by many newer tests. Also delete tst-tls2.c which tests .tls_common (unused by modern GCC and unsupported by Clang/LLD). .tls_common and .tbss definition are almost identical after linking, so the runtime test doesn't add additional coverage. Assembler and linker tests should be on the binutils side. When LLD 13.0.0 is allowed in configure.ac (https://sourceware.org/pipermail/libc-alpha/2021-August/129866.html), `make check` result is on par with glibc built with GNU ld on aarch64 and x86_64. As a future clean-up, TLS_GD/TLS_LD/TLS_IE/TLS_IE macros can be removed from sysdeps/*/tls-macros.h. We can add optional -mtls-dialect={gnu2,trad} tests to ensure coverage. Tested on aarch64-linux-gnu, powerpc64le-linux-gnu, and x86_64-linux-gnu. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* powerpc64: Add checks for Altivec and VSX in ifunc selectionAnton Blanchard2021-08-0626-68/+139
| | | | | | | We'd like to support processors without Altivec or VSX, so check the relevant hwcap bits before selecting them. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64: Check cacheline size before using optimised memset routinesAnton Blanchard2021-08-062-10/+23
| | | | | | | A number of optimised memset routines assume the cacheline size is 128B, so we better check before using them. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64: Replace some PPC_FEATURE_HAS_VSX with PPC_FEATURE_ARCH_2_06Anton Blanchard2021-08-0620-38/+38
| | | | | | | | We use PPC_FEATURE_HAS_VSX to select a number of POWER7 optimised functions. These functions don't use any VSX instructions, so PPC_FEATURE_ARCH_2_06 seems like a better fit. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* Force building with -fno-commonFlorian Weimer2021-07-091-4/+4
| | | | | | | | | | As a result, is not necessary to specify __attribute__ ((nocommon)) on individual definitions. GCC 10 defaults to -fno-common on all architectures except ARC, but this change is compatible with older GCC versions and ARC, too. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* powerpc64le: Fix typo in configureAnton Blanchard2021-07-082-2/+2
| | | | | | The configure script checks for -mlong-double-128 but mentions -mlongdouble when it fails. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc64: Remove strcspn ifunc from the loaderTulio Magno Quites Machado Filho2021-07-081-0/+18
| | | | | | | | | | 5 years ago, commit 8f1b841e452dbb083112fd036033b7f4af506ba0 unintentionally added an ifunc to the loader. That modification has not caused any harm so far, but it doesn't add any value either, because the hwcap information is available later during libc initialization. Suggested-by: Anton Blanchard <anton@ozlabs.org>
* Update powerpc-nofpu libm-test-ulpsJoseph Myers2021-07-071-41/+45
|
* powerpc: optimize strcpy/stpcpy for POWER9/10Pedro Franco de Carvalho2021-07-011-71/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch modifies the current POWER9 implementation of strcpy and stpcpy to optimize it for POWER9/10. Since no new POWER10 instructions are used, the original POWER9 strcpy is modified instead of creating a new implementation for POWER10. This implementation is based on both the original POWER9 implementation of strcpy and the preamble of the new POWER10 implementation of strlen. The changes also affect stpcpy, which uses the same implementation with some additional code before returning. On POWER9, averaging improvements across the benchmark inputs (length/source alignment/destination alignment), for an experiment that ran the benchmark five times, bench-strcpy showed an improvement of 5.23%, and bench-stpcpy showed an improvement of 6.59%. On POWER10, bench-strcpy showed 13.16%, and bench-stpcpy showed 13.59%. The changes are: 1. Removed the null string optimization. Although this results in a few extra cycles for the null string, in combination with the second change, this resulted in improvements for for other cases. 2. Adapted the preamble from strlen for POWER10. This is the part of the function that handles up to the first 16 bytes of the string. 3. Increased number of unrolled iterations in the main loop to 6. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Tested-by: Matheus Castanho <msc@linux.ibm.com>
* Add build option to disable usage of scv on powerpcMatheus Castanho2021-06-101-8/+8
| | | | | | | | | | | | | | | | | Commit 68ab82f56690ada86ac1e0c46bad06ba189a10ef added support for the scv syscall ABI on powerpc. Since then systems that have kernel and processor support started using scv. However adding the proper support for a new syscall ABI requires changes to several other projects (e.g. qemu, valgrind, strace, kernel), which are gradually receiving support. Meanwhile, having a way to disable scv on glibc at build time can be useful for distros that may encounter conflicts with projects that still do not support the scv ABI, buying time until proper support is added. This commit adds a --disable-scv option that disables scv support and uses sc for all syscalls, like before commit 68ab82f56690ada86ac1e0c46bad06ba189a10ef. Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* Remove stale references to libdl.aFlorian Weimer2021-06-092-2/+0
| | | | | | | | Since commit 0c1c3a771eceec46e66ce1183cf988e2303bd373 ("dlfcn: Move dlopen into libc") libdl.a is empty, so linking against it is no longer necessary. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* powerpc: Optimized memcmp for power10Lucas A. M. Magalhaes2021-05-315-1/+218
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch was based on the __memcmp_power8 and the recent __strlen_power10. Improvements from __memcmp_power8: 1. Don't need alignment code. On POWER10 lxvp and lxvl do not generate alignment interrupts, so they are safe for use on caching-inhibited memory. Notice that the comparison on the main loop will wait for both VSR to be ready. Therefore aligning one of the input address does not improve performance. In order to align both registers a vperm is necessary which add too much overhead. 2. Uses new POWER10 instructions This code uses lxvp to decrease contention on load by loading 32 bytes per instruction. The vextractbm is used to have a smaller tail code for calculating the return value. 3. Performance improvement This version has around 35% better performance on average. I saw no performance regressions for any length or alignment. Thanks Matheus for helping me out with some details. Co-authored-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* powerpc: Fix handling of scv return error codes [BZ #27892]Nicholas Piggin2021-05-241-2/+3
| | | | | | | | | | | When using scv for templated ASM syscalls, current code interprets any negative return value as error, but the only valid error codes are in the range -4095..-1 according to the ABI. This commit also fixes 'signal.gen.test' strace test, where the issue was first identified. Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
* Properly check stack alignment [BZ #27901]H.J. Lu2021-05-241-20/+7
| | | | | | | | | | | | | | | | | | | | | | 1. Replace if ((((uintptr_t) &_d) & (__alignof (double) - 1)) != 0) which may be optimized out by compiler, with int __attribute__ ((weak, noclone, noinline)) is_aligned (void *p, int align) { return (((uintptr_t) p) & (align - 1)) != 0; } 2. Add TEST_STACK_ALIGN_INIT to TEST_STACK_ALIGN. 3. Add a common TEST_STACK_ALIGN_INIT to check 16-byte stack alignment for both i386 and x86-64. 4. Update powerpc to use TEST_STACK_ALIGN_INIT. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* powerpc64le: Check HWCAP bits against compiler build flagsFlorian Weimer2021-05-191-0/+52
| | | | | | | When built with GCC 11.1 and -mcpu=power9, ld.so prints this error message when running on POWER8: Fatal glibc error: CPU lacks ISA 3.00 support (POWER9 or later required)
* powerpc: Add optimized rawmemchr for POWER10Matheus Castanho2021-05-176-27/+188
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reuse code for optimized strlen to implement a faster version of rawmemchr. This takes advantage of the same benefits provided by the strlen implementation, but needs some extra steps. __strlen_power10 code should be unchanged after this change. rawmemchr returns a pointer to the char found, while strlen returns only the length, so we have to take that into account when preparing the return value. To quickly check 64B, the loop on __strlen_power10 merges the whole block into 16B by using unsigned minimum vector operations (vminub) and checks if there are any \0 on the resulting vector. The same code is used by rawmemchr if the char c is 0. However, this approach does not work when c != 0. We first need to subtract each byte by c, so that the value we are looking for is converted to a 0, then taking the minimum and checking for nulls works again. The new code branches after it has compared ~256 bytes and chooses which of the two strategies above will be used in the main loop, based on the char c. This extra branch adds some overhead (~5%) for length ~256, but is quickly amortized by the faster loop for larger sizes. Compared to __rawmemchr_power9, this version is ~20% faster for length < 256. Because of the optimized main loop, the improvement becomes ~35% for c != 0 and ~50% for c = 0 for strings longer than 256. Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com>
* powerpc64le: Fix ifunc selection for memset, memmove, bzero and bcopyRaoni Fassina Firmino2021-05-075-20/+22
| | | | | | | | | The hwcap2 check for the aforementioned functions should check for both PPC_FEATURE2_ARCH_3_1 and PPC_FEATURE2_HAS_ISEL but was mistakenly checking for any one of them, enabling isa 3.1 version of the functions in incompatible processors, like POWER8. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* Remove architecture specific sched_cpucount optimizationsAdhemerval Zanella2021-05-071-22/+0
| | | | | | | | | | | | | | And replace the generic algorithm with the Brian Kernighan's one. GCC optimize it with popcnt if the architecture supports, so there is no need to add the extra POPCNT define to enable it. This is really a micro-optimization that only adds complexity: recent ABIs already support it (x86-64-v2 or power64le) and it simplifies the code for internal usage, since i686 does not allow an internal iFUNC call. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu.
* powerpc64le: Optimize memset for POWER10Raoni Fassina Firmino2021-04-306-1/+314
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This implementation is based on __memset_power8 and integrates a lot of suggestions from Anton Blanchard. The biggest difference is that it makes extensive use of stxvl to alignment and tail code to avoid branches and small stores. It has three main execution paths: a) "Short lengths" for lengths up to 64 bytes, avoiding as many branches as possible. b) "General case" for larger lengths, it has an alignment section using stxvl to avoid branches, a 128 bytes loop and then a tail code, again using stxvl with few branches. c) "Zeroing cache blocks" for lengths from 256 bytes upwards and set value being zero. It is mostly the __memset_power8 code but the alignment phase was simplified because, at this point, address is already 16-bytes aligned and also changed to use vector stores. The tail code was also simplified to reuse the general case tail. All unaligned stores use stxvl instructions that do not generate alignment interrupts on POWER10, making it safe to use on caching-inhibited memory. On average, this implementation provides something around 30% improvement when compared to __memset_power8. Reviewed-by: Matheus Castanho <msc@linux.ibm.com> Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc64le: Optimize memcpy for POWER10Tulio Magno Quites Machado Filho2021-04-305-1/+238
| | | | | | | | | | | | | | | | | | This implementation is based on __memcpy_power8_cached and integrates suggestions from Anton Blanchard. It benefits from loads and stores with length for short lengths and for tail code, simplifying the code. All unaligned memory accesses use instructions that do not generate alignment interrupts on POWER10, making it safe to use on caching-inhibited memory. The main loop has also been modified in order to increase instruction throughput by reducing the dependency on updates from previous iterations. On average, this implementation provides around 30% improvement when compared to __memcpy_power7 and 10% improvement in comparison to __memcpy_power8_cached.
* powerpc64le: Optimized memmove for POWER10Lucas A. M. Magalhaes2021-04-308-7/+388
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch was initially based on the __memmove_power7 with some ideas from strncpy implementation for Power 9. Improvements from __memmove_power7: 1. Use lxvl/stxvl for alignment code. The code for Power 7 uses branches when the input is not naturally aligned to the width of a vector. The new implementation uses lxvl/stxvl instead which reduces pressure on GPRs. It also allows the removal of branch instructions, implicitly removing branch stalls and mispredictions. 2. Use of lxv/stxv and lxvl/stxvl pair is safe to use on Cache Inhibited memory. On Power 10 vector load and stores are safe to use on CI memory for addresses unaligned to 16B. This code takes advantage of this to do unaligned loads. The unaligned loads don't have a significant performance impact by themselves. However doing so decreases register pressure on GPRs and interdependence stalls on load/store pairs. This also improved readability as there are now less code paths for different alignments. Finally this reduces the overall code size. 3. Improved performance. This version runs on average about 30% better than memmove_power7 for lengths larger than 8KB. For input lengths shorter than 8KB the improvement is smaller, it has on average about 17% better performance. This version has a degradation of about 50% for input lengths in the 0 to 31 bytes range when dest is unaligned. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
* powerpc: Add log IFUNC multiarch support for POWER10Raphael Moreira Zinsly2021-04-267-0/+101
| | | | | | | Checked on ppc64le built without --with-cpu, with --with-cpu=power9 and with --disable-multi-arch. Reviewed-by: Matheus Castanho <msc@linux.ibm.com>
* nptl: Move pthread_spin_trylock into libcFlorian Weimer2021-04-231-1/+9
| | | | The symbol was moved using scripts/move-symbol-to-libc.py.
* nptl: Move pthread_spin_lock into libcFlorian Weimer2021-04-231-1/+7
| | | | The symbol was moved using scripts/move-symbol-to-libc.py.
* nptl: Move pthread_spin_init, Move pthread_spin_unlock into libcFlorian Weimer2021-04-231-1/+9
| | | | | | | For some architectures, the two functions are aliased, so these symbols need to be moved at the same time. The symbols were moved using scripts/move-symbol-to-libc.py.
* powerpc: Add optimized strlen for POWER10Matheus Castanho2021-04-225-1/+230
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improvements compared to POWER9 version: 1. Take into account first 16B comparison for aligned strings The previous version compares the first 16B and increments r4 by the number of bytes until the address is 16B-aligned, then starts doing aligned loads at that address. For aligned strings, this causes the first 16B to be compared twice, because the increment is 0. Here we calculate the next 16B-aligned address differently, which avoids that issue. 2. Use simple comparisons for the first ~192 bytes The main loop is good for big strings, but comparing 16B each time is better for smaller strings. So after aligning the address to 16 Bytes, we check more 176B in 16B chunks. There may be some overlaps with the main loop for unaligned strings, but we avoid using the more aggressive strategy too soon, and also allow the loop to start at a 64B-aligned address. This greatly benefits smaller strings and avoids overlapping checks if the string is already aligned at a 64B boundary. 3. Reduce dependencies between load blocks caused by address calculation on loop Doing a precise time tracing on the code showed many loads in the loop were stalled waiting for updates to r4 from previous code blocks. This implementation avoids that as much as possible by using 2 registers (r4 and r5) to hold addresses to be used by different parts of the code. Also, the previous code aligned the address to 16B, then to 64B by doing a few 48B loops (if needed) until the address was aligned. The main loop could not start until that 48B loop had finished and r4 was updated with the current address. Here we calculate the address used by the loop very early, so it can start sooner. The main loop now uses 2 pointers 128B apart to make pointer updates less frequent, and also unrolls 1 iteration to guarantee there is enough time between iterations to update the pointers, reducing stalled cycles. 4. Use new P10 instructions lxvp is used to load 32B with a single instruction, reducing contention in the load queue. vextractbm allows simplifying the tail code for the loop, replacing vbpermq and avoiding having to generate a permute control vector. Reviewed-by: Paul E Murphy <murphyp@linux.ibm.com> Reviewed-by: Raphael M Zinsly <rzinsly@linux.ibm.com> Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
* nptl: Move __pthread_unwind_next into libcFlorian Weimer2021-04-212-13/+5
| | | | | | | | | | | | | | | | | | | It's necessary to stub out __libc_disable_asynccancel and __libc_enable_asynccancel via rtld-stubbed-symbols because the new direct references to the unwinder result in symbol conflicts when the rtld exception handling from libc is linked in during the construction of librtld.map. unwind-forcedunwind.c is merged into unwind-resume.c. libc now needs the functions that were previously only used in libpthread. The GLIBC_PRIVATE exports of __libc_longjmp and __libc_siglongjmp are no longer needed, so switch them to hidden symbols. The symbol __pthread_unwind_next has been moved using scripts/move-symbol-to-libc.py. Reviewed-by: Adhemerva Zanella <adhemerval.zanella@linaro.org>
* powerpc: Update libm test ulpsTulio Magno Quites Machado Filho2021-04-091-10/+10
| | | | Update after commit 43576de04afc6a0896a3ecc094e1581069a0652a.