| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
This reverts commit a53451559dc9cce765ea5bcbb92c4007e058e92b.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Key Points:
1. On lasx & lsx platforms, We must use _dl_runtime_{profile, resolve}_{lsx, lasx}
to save vector registers.
2. Via "tunables", users can choose str/mem_{lasx,lsx,unaligned} functions with
`export GLIBC_TUNABLES=glibc.cpu.hwcaps=LASX,...`.
Note: glibc.cpu.hwcaps doesn't affect _dl_runtime_{profile, resolve}_{lsx, lasx}
selection.
Usage Notes:
1. Only valid inputs: LASX, LSX, UAL. Case-sensitive, comma-separated, no spaces.
2. Example: `export GLIBC_TUNABLES=glibc.cpu.hwcaps=LASX,UAL` turns on LASX & UAL.
Unmentioned features turn off. With default ifunc: lasx > lsx > unaligned >
aligned > generic, effect is: lasx > unaligned > aligned > generic; lsx off.
3. Incorrect GLIBC_TUNABLES settings will show error messages.
For example: On lsx platforms, you cannot enable lasx features. If you do
that, you will get error messages.
4. Valid input examples:
- GLIBC_TUNABLES=glibc.cpu.hwcaps=LASX: lasx > aligned > generic.
- GLIBC_TUNABLES=glibc.cpu.hwcaps=LSX,UAL: lsx > unaligned > aligned > generic.
- GLIBC_TUNABLES=glibc.cpu.hwcaps=LASX,UAL,LASX,UAL,LSX,LASX,UAL: Repetitions
allowed but not recommended. Results in: lasx > lsx > unaligned > aligned >
generic.
|
|
|
|
|
| |
Change to put magic number to .rodata section in memmove-lsx, and use
pcalau12i and %pc_lo12 with vld to get the data.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
According to glibc strrchr microbenchmark test results, this implementation
could reduce the runtime time as following:
Name Percent of rutime reduced
strrchr-lasx 10%-50%
strrchr-lsx 0%-50%
strrchr-aligned 5%-50%
Generic strrchr is implemented by function strlen + memrchr, the lasx version
will compare with generic strrchr implemented by strlen-lasx + memrchr-lasx,
the lsx version will compare with generic strrchr implemented by strlen-lsx +
memrchr-lsx, the aligned version will compare with generic strrchr implemented
by strlen-aligned + memrchr-generic.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
According to glibc strcpy and stpcpy microbenchmark test results(changed
to use generic_strcpy and generic_stpcpy instead of strlen + memcpy),
comparing with the generic version, this implementation could reduce the
runtime as following:
Name Percent of rutime reduced
strcpy-aligned 8%-45%
strcpy-unaligned 8%-48%, comparing with the aligned version, unaligned
version takes less instructions to copy the tail of data
which length is less than 8. it also has better performance
in case src and dest cannot be both aligned with 8bytes
strcpy-lsx 20%-80%
strcpy-lasx 15%-86%
stpcpy-aligned 6%-43%
stpcpy-unaligned 8%-48%
stpcpy-lsx 10%-80%
stpcpy-lasx 10%-87%
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
According to glibc memcmp microbenchmark test results(Add generic
memcmp), this implementation have performance improvement
except the length is less than 3, details as below:
Name Percent of time reduced
memcmp-lasx 16%-74%
memcmp-lsx 20%-50%
memcmp-aligned 5%-20%
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
According to glibc memset microbenchmark test results, for LSX and LASX
versions, A few cases with length less than 8 experience performace
degradation, overall, the LASX version could reduce the runtime about
15% - 75%, LSX version could reduce the runtime about 15%-50%.
The unaligned version uses unaligned memmory access to set data which
length is less than 64 and make address aligned with 8. For this part,
the performace is better than aligned version. Comparing with the generic
version, the performance is close when the length is larger than 128. When
the length is 8-128, the unaligned version could reduce the runtime about
30%-70%, the aligned version could reduce the runtime about 20%-50%.
|
|
|
|
|
|
|
|
|
| |
According to glibc memrchr microbenchmark, this implementation could reduce
the runtime as following:
Name Percent of rutime reduced
memrchr-lasx 20%-83%
memrchr-lsx 20%-64%
|
|
|
|
|
|
|
|
|
|
| |
According to glibc memchr microbenchmark, this implementation could reduce
the runtime as following:
Name Percent of runtime reduced
memchr-lasx 37%-83%
memchr-lsx 30%-66%
memchr-aligned 0%-15%
|
|
|
|
|
|
|
|
|
| |
According to glibc rawmemchr microbenchmark, A few cases tested with
char '\0' experience performance degradation due to the lasx and lsx
versions don't handle the '\0' separately. Overall, rawmemchr-lasx
implementation could reduce the runtime about 40%-80%, rawmemchr-lsx
implementation could reduce the runtime about 40%-66%, rawmemchr-aligned
implementation could reduce the runtime about 20%-40%.
|
|
|
|
|
|
| |
We are requiring Binutils >= 2.41, so la.pcrel always works here.
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
|
|
|
|
|
|
|
| |
We are strictly requiring GAS >= 2.41 now, so we don't need to check
assembler capability anymore.
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
|
|
|
|
|
|
|
|
| |
Based on the glibc microbenchmark, only a few short inputs with this
strncmp-aligned and strncmp-lsx implementation experience performance
degradation, overall, strncmp-aligned could reduce the runtime 0%-10%
for aligned comparision, 10%-25% for unaligend comparision, strncmp-lsx
could reduce the runtime about 0%-60%.
|
|
|
|
|
|
| |
Based on the glibc microbenchmark, strcmp-aligned implementation could
reduce the runtime 0%-10% for aligned comparison, 10%-20% for unaligned
comparison, strcmp-lsx implemenation could reduce the runtime 0%-50%.
|
|
|
|
|
|
|
| |
Based on the glibc microbenchmark, strnlen-aligned implementation could
reduce the runtime more than 10%, strnlen-lsx implementation could reduce
the runtime about 50%-78%, strnlen-lasx implementation could reduce the
runtime about 50%-88%.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
memmove{aligned, unaligned, lsx, lasx}
These implementations improve the time to copy data in the glibc
microbenchmark as below:
memcpy-lasx reduces the runtime about 8%-76%
memcpy-lsx reduces the runtime about 8%-72%
memcpy-unaligned reduces the runtime of unaligned data copying up to 40%
memcpy-aligned reduece the runtime of unaligned data copying up to 25%
memmove-lasx reduces the runtime about 20%-73%
memmove-lsx reduces the runtime about 50%
memmove-unaligned reduces the runtime of unaligned data moving up to 40%
memmove-aligned reduces the runtime of unaligned data moving up to 25%
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
strchrnul{aligned, lsx, lasx}
These implementations improve the time to run strchr{nul}
microbenchmark in glibc as below:
strchr-lasx reduces the runtime about 50%-83%
strchr-lsx reduces the runtime about 30%-67%
strchr-aligned reduces the runtime about 10%-20%
strchrnul-lasx reduces the runtime about 50%-83%
strchrnul-lsx reduces the runtime about 36%-65%
strchrnul-aligned reduces the runtime about 6%-10%
|
|
|
|
|
|
| |
strlen-lasx is implemeted by LASX simd instructions(256bit)
strlen-lsx is implemeted by LSX simd instructions(128bit)
strlen-align is implemented by LA basic instructions and never use unaligned memory acess
|
|
|
|
|
|
|
| |
LoongArch glibc can add some LASX/LSX vector instructions codes,
change the required minimum binutils version to 2.41 which could
support vector instructions. HAVE_LOONGARCH_VEC_ASM is removed
accordingly.
|
|
|
|
|
|
| |
The following usage of macro LEAF/ENTRY are all feasible:
1. LEAF(fcn) -- the align value of fcn is .align 3(default value)
2. LEAF(fcn, 6) -- the align value of fcn is .align 6
|
|
|
|
|
|
| |
This patch allows the static PIE startfile rcrt1.o to be built
without requiring libgcc_s.so from GCC, which depends on libc
in the first place.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bump autoconf requirement to 2.71 to allow regenerating configure on
more recent distributions. autoconf 2.71 has been in Fedora since F36
and is the current version in Debian stable (bookworm). It appears to
be current in Gentoo as well.
All sysdeps configure and preconfigure scripts have also been
regenerated; all changes are trivial transformations that do not affect
functionality.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch checks if assembler supports vector instructions to
generate LASX/LSX code or not, and then define HAVE_LOONGARCH_VEC_ASM macro
We have added support for vector instructions in binutils-2.41
See:
https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=75b2f521b101d974354f6ce9ed7c054d8b2e3b7a
commit 75b2f521b101d974354f6ce9ed7c054d8b2e3b7a
Author: mengqinggang <mengqinggang@loongson.cn>
Date: Thu Jun 22 10:35:28 2023 +0800
LoongArch: gas: Add lsx and lasx instructions support
gas/ChangeLog:
* config/tc-loongarch.c (md_parse_option): Add lsx and lasx option.
(loongarch_after_parse_args): Add lsx and lasx option.
opcodes/ChangeLog:
* loongarch-opc.c (struct loongarch_ase): Add lsx and lasx
instructions.
|
|
|
|
|
| |
It's better to add "\" before "EOF" and remove "\"
before "$".
|
|
|
|
| |
This commit can fix the FAIL item: elf/tst-sprof-basic.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before GCC r13-2728, it would produce a normal dynamic-linked executable
with -static-pie. I mistakely believed it would produce a static-linked
executable, so failed to detect the breakage. Then with Binutils 2.40
and (vanilla) GCC 12, libc_cv_static_pie_on_loongarch is mistakenly
enabled and cause a building failure with "undefined reference to
_DYNAMIC".
Fix the issue by disabling static PIE if -static-pie creates something
with a INTERP header.
|
|
|
|
|
|
|
|
| |
This patch implements the LoongArch specific math barriers in order to omit
the store and load from stack if possible.
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
|
|
|
|
|
| |
Run using vanilla upstream autoconf 2.69.
Minor whitespace change to sysdeps/loongarch/configure and
sysdeps/mach/configure, and nothing else.
|
| |
|
|
|
|
|
| |
This patch is used to fix address out-of-bounds error when building
Chrome.
|
|
|
|
| |
Add inline assembler for the ilogb functions. Passes GLIBC regression.
|
|
|
|
| |
Add inline assembler for the scalb functions. Passes GLIBC regression.
|
|
|
|
|
|
|
|
| |
Add inline assembler for the scalbn functions. Passes GLIBC regression.
GCC 13, LoongArch support ___builtin_scalbn{,f} with -fno-math-errno,
but only "libm" can use -fno-math-errno in GLIBC, and scalbn is in libc
instead of libm because __printf_fp calls it.
|
|
|
|
|
|
|
|
| |
GCC 13 compiles these built-ins instead of generic
implementation for function logb.
Link: https://gcc.gnu.org/r13-3922
Co-Authored-By: Xi Ruoyao <xry111@xry111.site>
|
|
|
|
|
|
|
|
| |
GCC 13 compiles these built-ins instead of generic
implementation for function llrint.
Link: https://gcc.gnu.org/r13-3920
Co-Authored-By: Xi Ruoyao <xry111@xry111.site>
|
|
|
|
|
|
|
|
| |
GCC 13 compiles these built-ins instead of generic
implementation for function lrint.
Link: https://gcc.gnu.org/r13-3920
Co-Authored-By: Xi Ruoyao <xry111@xry111.site>
|
|
|
|
|
|
| |
GCC 13 compiles these built-ins to frint.{d,s} instruction.
Link: https://gcc.gnu.org/r13-3919
|
|
|
|
|
|
|
|
|
|
| |
Use hardware Floating-point instruction f{maxa/mina}.{s/d}, fclass.{s/d}
to implement fmaximum_mag_num{f/ }, fminimum_mag_num{f/ }.
* sysdeps/loongarch/fpu/s_fmaximum_mag_num.c: New file.
* sysdeps/loongarch/fpu/s_fmaximum_mag_numf.c: Likewise.
* sysdeps/loongarch/fpu/s_fminimum_mag_num.c: Likewise.
* sysdeps/loongarch/fpu/s_fminimum_mag_numf.c: Likewise.
|
|
|
|
|
|
|
|
|
|
| |
Use hardware Floating-point instruction f{maxa/mina}.{s/d}, fclass.{s/d}
to implement fmaximum_mag{f/ }, fminimum_mag{f/ }.
* sysdeps/loongarch/fpu/s_fmaximum_mag.c: New file.
* sysdeps/loongarch/fpu/s_fmaximum_magf.c: Likewise.
* sysdeps/loongarch/fpu/s_fminimum_mag.c: Likewise.
* sysdeps/loongarch/fpu/s_fminimum_magf.c: Likewise.
|
|
|
|
|
|
|
|
|
|
| |
Use hardware Floating-point instruction f{maxa/mina}.{s/d},
to implement fmaxmag{f/ }, fminmag{f/ }.
* sysdeps/loongarch/fpu/s_fmaxmag.c: New file.
* sysdeps/loongarch/fpu/s_fmaxmagf.c: Likewise.
* sysdeps/loongarch/fpu/s_fminmag.c: Likewise.
* sysdeps/loongarch/fpu/s_fminmagf.c: Likewise.
|
|
|
|
|
|
|
|
|
|
| |
Use hardware Floating-point instruction f{max/min}.{s/d}, fclass.{s/d}
to implement fmaximum_num{f/ }, fminimum_num{f/ }.
* sysdeps/loongarch/fpu/s_fmaximum_num.c: New file.
* sysdeps/loongarch/fpu/s_fmaximum_numf.c: Likewise.
* sysdeps/loongarch/fpu/s_fminimum_num.c: Likewise.
* sysdeps/loongarch/fpu/s_fminimum_numf.c: Likewise.
|
|
|
|
|
|
|
|
|
|
| |
Use hardware Floating-point instruction f{max/min}.{s/d}, fclass.{s/d}
to implement fmaximum{f/ }, fminimum{f/ }.
* sysdeps/loongarch/fpu/s_fmaximum.c: New file.
* sysdeps/loongarch/fpu/s_fmaximumf.c: Likewise.
* sysdeps/loongarch/fpu/s_fminimum.c: Likewise.
* sysdeps/loongarch/fpu/s_fminimumf.c: Likewise.
|