about summary refs log tree commit diff
path: root/manual/tunables.texi
Commit message (Collapse)AuthorAgeFilesLines
* x86: Add seperate non-temporal tunable for memsetNoah Goldstein2024-05-301-1/+15
| | | | | | | | | | | The tuning for non-temporal stores for memset vs memcpy is not always the same. This includes both the exact value and whether non-temporal stores are profitable at all for a given arch. This patch add `x86_memset_non_temporal_threshold`. Currently we disable non-temporal stores for non Intel vendors as the only benchmarks showing its benefit have been on Intel hardware. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
* LoongArch: Add glibc.cpu.hwcap support.caiyinyu2024-04-241-1/+4
| | | | | | | | | | | | | | | | The current IFUNC selection is always using the most recent features which are available via AT_HWCAP. But in some scenarios it is useful to adjust this selection. The environment variable: GLIBC_TUNABLES=glibc.cpu.hwcaps=-xxx,yyy,zzz,.... can be used to enable HWCAP feature yyy, disable HWCAP feature xxx, where the feature name is case-sensitive and has to match the ones used in sysdeps/loongarch/cpu-tunables.c. Signed-off-by: caiyinyu <caiyinyu@loongson.cn>
* manual/tunables - Add entry for enable_secure tunable.Joe Talbott2024-03-011-0/+10
|
* elf: Add ELF_DYNAMIC_AFTER_RELOC to rewrite PLTH.J. Lu2024-01-051-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add ELF_DYNAMIC_AFTER_RELOC to allow target specific processing after relocation. For x86-64, add #define DT_X86_64_PLT (DT_LOPROC + 0) #define DT_X86_64_PLTSZ (DT_LOPROC + 1) #define DT_X86_64_PLTENT (DT_LOPROC + 3) 1. DT_X86_64_PLT: The address of the procedure linkage table. 2. DT_X86_64_PLTSZ: The total size, in bytes, of the procedure linkage table. 3. DT_X86_64_PLTENT: The size, in bytes, of a procedure linkage table entry. With the r_addend field of the R_X86_64_JUMP_SLOT relocation set to the memory offset of the indirect branch instruction. Define ELF_DYNAMIC_AFTER_RELOC for x86-64 to rewrite the PLT section with direct branch after relocation when the lazy binding is disabled. PLT rewrite is disabled by default since SELinux may disallow modifying code pages and ld.so can't detect it in all cases. Use $ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=1 to enable PLT rewrite with 32-bit direct jump at run-time or $ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=2 to enable PLT rewrite with 32-bit direct jump and on APX processors with 64-bit absolute jump at run-time. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
* elf: Remove /etc/suid-debug supportAdhemerval Zanella2023-11-211-3/+1
| | | | | | | | | | | | | | | | Since malloc debug support moved to a different library (libc_malloc_debug.so), the glibc.malloc.check requires preloading the debug library to enable it. It means that suid-debug support has not been working since 2.34. To restore its support, it would require to add additional information and parsing to where to find libc_malloc_debug.so. It is one thing less that might change AT_SECURE binaries' behavior due to environment configurations. Checked on x86_64-linux-gnu. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* AArch64: Remove Falkor memcpyWilco Dijkstra2023-11-131-1/+1
| | | | | | | | | | The latest implementations of memcpy are actually faster than the Falkor implementations [1], so remove the falkor/phecda ifuncs for memcpy and the now unused IS_FALKOR/IS_PHECDA defines. [1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* elf: Add glibc.mem.decorate_maps tunableAdhemerval Zanella2023-11-071-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PR_SET_VMA_ANON_NAME support is only enabled through a configurable kernel switch, mainly because assigning a name to a anonymous virtual memory area might prevent that area from being merged with adjacent virtual memory areas. For instance, with the following code: void *p1 = mmap (NULL, 1024 * 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); void *p2 = mmap (p1 + (1024 * 4096), 1024 * 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); The kernel will potentially merge both mappings resulting in only one segment of size 0x800000. If the segment is names with PR_SET_VMA_ANON_NAME with different names, it results in two mappings. Although this will unlikely be an issue for pthread stacks and malloc arenas (since for pthread stacks the guard page will result in a PROT_NONE segment, similar to the alignment requirement for the arena block), it still might prevent the mmap memory allocated for detail malloc. There is also another potential scalability issue, where the prctl requires to take the mmap global lock which is still not fully fixed in Linux [1] (for pthread stacks and arenas, it is mitigated by the stack cached and the arena reuse). So this patch disables anonymous mapping annotations as default and add a new tunable, glibc.mem.decorate_maps, can be used to enable it. [1] https://lwn.net/Articles/906852/ Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>
* PowerPC: Influence cpu/arch hwcap features via GLIBC_TUNABLESMahesh Bodapati2023-08-011-1/+4
| | | | | | | | | | | This patch enables the option to influence hwcaps used by PowerPC. The environment variable, GLIBC_TUNABLES=glibc.cpu.hwcaps=-xxx,yyy,-zzz...., can be used to enable CPU/ARCH feature yyy, disable CPU/ARCH feature xxx and zzz, where the feature name is case-sensitive and has to match the ones mentioned in the file{sysdeps/powerpc/dl-procinfo.c}. Note that the hwcap tunables only used in the IFUNC selection. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* Fix misspellings in manual/ -- BZ 25337Paul Pluzhnikov2023-05-271-2/+2
|
* Created tunable to force small pages on stack allocation.Cupertino Miranda2023-04-201-0/+15
| | | | | | | | | | Created tunable glibc.pthread.stack_hugetlb to control when hugepages can be used for stack allocation. In case THP are enabled and glibc.pthread.stack_hugetlb is set to 0, glibc will madvise the kernel not to use allow hugepages for stack allocations. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* tunables.texi: Change \code{1} to @code{1}H.J. Lu2023-02-231-1/+1
| | | | | | Update 317f1c0a8a x86-64: Add glibc.cpu.prefer_map_32bit_exec [BZ #28656]
* x86-64: Add glibc.cpu.prefer_map_32bit_exec [BZ #28656]H.J. Lu2023-02-221-9/+24
| | | | | | | | | | | | | | | | | | | Crossing 2GB boundaries with indirect calls and jumps can use more branch prediction resources on Intel Golden Cove CPU (see the "Misprediction for Branches >2GB" section in Intel 64 and IA-32 Architectures Optimization Reference Manual.) There is visible performance improvement on workloads with many PLT calls when executable and shared libraries are mmapped below 2GB. Add the Prefer_MAP_32BIT_EXEC bit so that mmap will try to map executable or denywrite pages in shared libraries with MAP_32BIT first. NB: Prefer_MAP_32BIT_EXEC reduces bits available for address space layout randomization (ASLR), which is always disabled for SUID programs and can only be enabled by the tunable, glibc.cpu.prefer_map_32bit_exec, or the environment variable, LD_PREFER_MAP_32BIT_EXEC. This works only between shared libraries or between shared libraries and executables with addresses below 2GB. PIEs are usually loaded at a random address above 4GB by the kernel.
* gmon: improve mcount overflow handling [BZ# 27576]Simon Kissane2023-02-221-0/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mcount overflows, no gmon.out file is generated, but no message is printed to the user, leaving the user with no idea why, and thinking maybe there is some bug - which is how BZ 27576 ended up being logged. Print a message to stderr in this case so the user knows what is going on. As a comment in sys/gmon.h acknowledges, the hardcoded MAXARCS value is too small for some large applications, including the test case in that BZ. Rather than increase it, add tunables to enable MINARCS and MAXARCS to be overridden at runtime (glibc.gmon.minarcs and glibc.gmon.maxarcs). So if a user gets the mcount overflow error, they can try increasing maxarcs (they might need to increase minarcs too if the heuristic is wrong in their case.) Note setting minarcs/maxarcs too large can cause monstartup to fail with an out of memory error. If you set them large enough, it can cause an integer overflow in calculating the buffer size. I haven't done anything to defend against that - it would not generally be a security vulnerability, since these tunables will be ignored in suid/sgid programs (due to the SXID_ERASE default), and if you can set GLIBC_TUNABLES in the environment of a process, you can take it over anyway (LD_PRELOAD, LD_LIBRARY_PATH, etc). I thought about modifying the code of monstartup to defend against integer overflows, but doing so is complicated, and I realise the existing code is susceptible to them even prior to this change (e.g. try passing a pathologically large highpc argument to monstartup), so I decided just to leave that possibility in-place. Add a test case which demonstrates mcount overflow and the tunables. Document the new tunables in the manual. Signed-off-by: Simon Kissane <skissane@gmail.com> Reviewed-by: DJ Delorie <dj@redhat.com>
* S390: Influence hwcaps/stfle via GLIBC_TUNABLES.Stefan Liebler2023-02-071-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch enables the option to influence hwcaps and stfle bits used by the s390 specific ifunc-resolvers. The currently x86-specific tunable glibc.cpu.hwcaps is also used on s390x to achieve the task. In addition the user can also set a CPU arch-level like z13 instead of single HWCAP and STFLE features. Note that the tunable only handles the features which are really used in the IFUNC-resolvers. All others are ignored as the values are only used inside glibc. Thus we can influence: - HWCAP_S390_VXRS (z13) - HWCAP_S390_VXRS_EXT (z14) - HWCAP_S390_VXRS_EXT2 (z15) - STFLE_MIE3 (z15) The influenced hwcap/stfle-bits are stored in the s390-specific cpu_features struct which also contains reserved fields for future usage. The ifunc-resolvers and users of stfle bits are adjusted to use the information from cpu_features struct. On 31bit, the ELF_MACHINE_IRELATIVE macro is now also defined. Otherwise the new ifunc-resolvers segfaults as they depend on the not yet processed_rtld_global_ro@GLIBC_PRIVATE relocation.
* malloc: Correct the documentation of the top_pad defaultFlorian Weimer2022-08-041-1/+1
| | | | | DEFAULT_TOP_PAD is defined as 131072 in sysdeps/generic/malloc-machine.h.
* AArch64: Add asymmetric faulting mode for tag violations in mem.tagging tunableTejas Belagod2022-06-301-0/+3
| | | | | | | | | The new asymmetric mode is available when HWCAP2_MTE3 is set (support is available), bit2 is set in the tunable (user request per application), and the system is configured such that the asymmetric mode is preferred over sync or async (per-cpu system-wide setting). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* x86: Add bounds `x86_non_temporal_threshold`Noah Goldstein2022-06-151-1/+1
| | | | | | | | | | | | | | | The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed by memmove-vec-unaligned-erms. The lower-bound is needed because memmove-vec-unaligned-erms unrolls the loop aggressively in the L(large_memset_4x) case. The upper-bound is needed because memmove-vec-unaligned-erms right-shifts the value of `x86_non_temporal_threshold` by LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow. The lack of lower-bound can be a correctness issue. The lack of upper-bound cannot.
* manual: fix reference to source fileAndreas Schwab2022-05-311-1/+1
|
* malloc: Add Huge Page support for mmapAdhemerval Zanella2021-12-151-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | With the morecore hook removed, there is not easy way to provide huge pages support on with glibc allocator without resorting to transparent huge pages. And some users and programs do prefer to use the huge pages directly instead of THP for multiple reasons: no splitting, re-merging by the VM, no TLB shootdowns for running processes, fast allocation from the reserve pool, no competition with the rest of the processes unlike THP, no swapping all, etc. This patch extends the 'glibc.malloc.hugetlb' tunable: the value '2' means to use huge pages directly with the system default size, while a positive value means and specific page size that is matched against the supported ones by the system. Currently only memory allocated on sysmalloc() is handled, the arenas still uses the default system page size. To test is a new rule is added tests-malloc-hugetlb2, which run the addes tests with the required GLIBC_TUNABLE setting. On systems without a reserved huge pages pool, is just stress the mmap(MAP_HUGETLB) allocation failure. To improve test coverage it is required to create a pool with some allocated pages. Checked on x86_64-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>
* malloc: Add madvise support for Transparent Huge PagesAdhemerval Zanella2021-12-151-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | Linux Transparent Huge Pages (THP) current supports three different states: 'never', 'madvise', and 'always'. The 'never' is self-explanatory and 'always' will enable THP for all anonymous pages. However, 'madvise' is still the default for some system and for such case THP will be only used if the memory range is explicity advertise by the program through a madvise(MADV_HUGEPAGE) call. To enable it a new tunable is provided, 'glibc.malloc.hugetlb', where setting to a value diffent than 0 enables the madvise call. This patch issues the madvise(MADV_HUGEPAGE) call after a successful mmap() call at sysmalloc() with sizes larger than the default huge page size. The madvise() call is disable is system does not support THP or if it has the mode set to "never" and on Linux only support one page size for THP, even if the architecture supports multiple sizes. To test is a new rule is added tests-malloc-hugetlb1, which run the addes tests with the required GLIBC_TUNABLE setting. Checked on x86_64-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>
* elf: Use new dependency sorting algorithm by defaultFlorian Weimer2021-12-141-1/+1
| | | | | | | The default has to change eventually, and there are no known failures that require a delay. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* nptl: Add glibc.pthread.rseq tunable to control rseq registrationFlorian Weimer2021-12-091-0/+10
| | | | | | | | This tunable allows applications to register the rseq area instead of glibc. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* elf: Fix slow DSO sorting behavior in dynamic loader (BZ #17645)Chung-Lin Tang2021-10-211-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This second patch contains the actual implementation of a new sorting algorithm for shared objects in the dynamic loader, which solves the slow behavior that the current "old" algorithm falls into when the DSO set contains circular dependencies. The new algorithm implemented here is simply depth-first search (DFS) to obtain the Reverse-Post Order (RPO) sequence, a topological sort. A new l_visited:1 bitfield is added to struct link_map to more elegantly facilitate such a search. The DFS algorithm is applied to the input maps[nmap-1] backwards towards maps[0]. This has the effect of a more "shallow" recursion depth in general since the input is in BFS. Also, when combined with the natural order of processing l_initfini[] at each node, this creates a resulting output sorting closer to the intuitive "left-to-right" order in most cases. Another notable implementation adjustment related to this _dl_sort_maps change is the removing of two char arrays 'used' and 'done' in _dl_close_worker to represent two per-map attributes. This has been changed to simply use two new bit-fields l_map_used:1, l_map_done:1 added to struct link_map. This also allows discarding the clunky 'used' array sorting that _dl_sort_maps had to sometimes do along the way. Tunable support for switching between different sorting algorithms at runtime is also added. A new tunable 'glibc.rtld.dynamic_sort' with current valid values 1 (old algorithm) and 2 (new DFS algorithm) has been added. At time of commit of this patch, the default setting is 1 (old algorithm). Signed-off-by: Chung-Lin Tang <cltang@codesourcery.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* manual: Drop the .so suffix in libc_malloc_debug descriptionSiddhesh Poyarekar2021-07-271-1/+1
| | | | | | | All references to libraries in the manual are without the .so prefix, so do the same for libc_malloc_debug. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* Move malloc hooks into a compat DSOSiddhesh Poyarekar2021-07-221-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | Remove all malloc hook uses from core malloc functions and move it into a new library libc_malloc_debug.so. With this, the hooks now no longer have any effect on the core library. libc_malloc_debug.so is a malloc interposer that needs to be preloaded to get hooks functionality back so that the debugging features that depend on the hooks, i.e. malloc-check, mcheck and mtrace work again. Without the preloaded DSO these debugging features will be nops. These features will be ported away from hooks in subsequent patches. Similarly, legacy applications that need hooks functionality need to preload libc_malloc_debug.so. The symbols exported by libc_malloc_debug.so are maintained at exactly the same version as libc.so. Finally, static binaries will no longer be able to use malloc debugging features since they cannot preload the debugging DSO. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
* glibc.malloc.check: Fix nit in documentationSiddhesh Poyarekar2021-07-071-1/+1
| | | | | | | The tunable will not work with *any* non-zero tunable value since its list of allowed values is 0-3. Fix the documentation to reflect that. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* nptl: Add glibc.pthread.stack_cache_size tunableFlorian Weimer2021-06-281-0/+9
| | | | | | | The valgrind/helgrind test suite needs a way to make stack dealloction more prompt, and this feature seems to be generally useful. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
* aarch64: Added optimized memcpy and memmove for A64FXNaohiro Tamura2021-05-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch optimizes the performance of memcpy/memmove for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill, and software pipelining. SVE assembler code for memcpy/memmove is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
* Improve documentation for malloc etc. (BZ#27719)Paul Eggert2021-04-131-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cover key corner cases (e.g., whether errno is set) that are well settled in glibc, fix some examples to avoid integer overflow, and update some other dated examples (code needed for K&R C, e.g.). * manual/charset.texi (Non-reentrant String Conversion): * manual/filesys.texi (Symbolic Links): * manual/memory.texi (Allocating Cleared Space): * manual/socket.texi (Host Names): * manual/string.texi (Concatenating Strings): * manual/users.texi (Setting Groups): Use reallocarray instead of realloc, to avoid integer overflow issues. * manual/filesys.texi (Scanning Directory Content): * manual/memory.texi (The GNU Allocator, Hooks for Malloc): * manual/tunables.texi: Use code font for 'malloc' instead of roman font. (Symbolic Links): Don't assume readlink return value fits in 'int'. * manual/memory.texi (Memory Allocation and C, Basic Allocation) (Malloc Examples, Alloca Example): * manual/stdio.texi (Formatted Output Functions): * manual/string.texi (Concatenating Strings, Collation Functions): Omit pointer casts that are needed only in ancient K&R C. * manual/memory.texi (Basic Allocation): Say that malloc sets errno on failure. Say "convert" rather than "cast", since casts are no longer needed. * manual/memory.texi (Basic Allocation): * manual/string.texi (Concatenating Strings): In examples, use C99 declarations after statements for brevity. * manual/memory.texi (Malloc Examples): Add portability notes for malloc (0), errno setting, and PTRDIFF_MAX. (Changing Block Size): Say that realloc (p, 0) acts like (p ? (free (p), NULL) : malloc (0)). Add xreallocarray example, since other examples can use it. Add portability notes for realloc (0, 0), realloc (p, 0), PTRDIFF_MAX, and improve notes for reallocating to the same size. (Allocating Cleared Space): Reword now-confusing discussion about replacement, and xref "Replacing malloc". * manual/stdio.texi (Formatted Output Functions): Don't assume message size fits in 'int'. * manual/string.texi (Concatenating Strings): Fix undefined behavior involving arithmetic on a freed pointer.
* ld.so: Add --list-tunables to print tunable valuesH.J. Lu2021-01-151-0/+38
| | | | | | Pass --list-tunables to ld.so to print tunables with min and max values. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
* elf: Add a tunable to control use of tagged memoryRichard Earnshaw2020-12-211-0/+35
| | | | | | | | | | | Add a new glibc tunable: mem.tagging. This is a decimal constant in the range 0-255 but used as a bit-field. Bit 0 enables use of tagged memory in the malloc family of functions. Bit 1 enables precise faulting of tag failure on platforms where this can be controlled. Other bits are currently unused, but if set will cause memory tag checking for the current process to be enabled in the kernel.
* Reversing calculation of __x86_shared_non_temporal_thresholdPatrick McGehearty2020-09-281-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The __x86_shared_non_temporal_threshold determines when memcpy on x86 uses non_temporal stores to avoid pushing other data out of the last level cache. This patch proposes to revert the calculation change made by H.J. Lu's patch of June 2, 2017. H.J. Lu's patch selected a threshold suitable for a single thread getting maximum performance. It was tuned using the single threaded large memcpy micro benchmark on an 8 core processor. The last change changes the threshold from using 3/4 of one thread's share of the cache to using 3/4 of the entire cache of a multi-threaded system before switching to non-temporal stores. Multi-threaded systems with more than a few threads are server-class and typically have many active threads. If one thread consumes 3/4 of the available cache for all threads, it will cause other active threads to have data removed from the cache. Two examples show the range of the effect. John McCalpin's widely parallel Stream benchmark, which runs in parallel and fetches data sequentially, saw a 20% slowdown with this patch on an internal system test of 128 threads. This regression was discovered when comparing OL8 performance to OL7. An example that compares normal stores to non-temporal stores may be found at https://vgatherps.github.io/2018-09-02-nontemporal/. A simple test shows performance loss of 400 to 500% due to a failure to use nontemporal stores. These performance losses are most likely to occur when the system load is heaviest and good performance is critical. The tunable x86_non_temporal_threshold can be used to override the default for the knowledgable user who really wants maximum cache allocation to a single thread in a multi-threaded system. The manual entry for the tunable has been expanded to provide more information about its purpose. modified: sysdeps/x86/cacheinfo.c modified: manual/tunables.texi
* rtld: Avoid using up static TLS surplus for optimizations [BZ #25051]Szabolcs Nagy2020-07-081-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | On some targets static TLS surplus area can be used opportunistically for dynamically loaded modules such that the TLS access then becomes faster (TLSDESC and powerpc TLS optimization). However we don't want all surplus TLS to be used for this optimization because dynamically loaded modules with initial-exec model TLS can only use surplus TLS. The new contract for surplus static TLS use is: - libc.so can have up to 192 bytes of IE TLS, - other system libraries together can have up to 144 bytes of IE TLS. - Some "optional" static TLS is available for opportunistic use. The optional TLS is now tunable: rtld.optional_static_tls, so users can directly affect the allocated static TLS size. (Note that module unloading with dlclose does not reclaim static TLS. After the optional TLS runs out, TLS access is no longer optimized to use static TLS.) The default setting of rtld.optional_static_tls is 512 so the surplus TLS is 3*192 + 4*144 + 512 = 1664 by default, the same as before. Fixes BZ #25051. Tested on aarch64-linux-gnu and x86_64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* rtld: Account static TLS surplus for audit modulesSzabolcs Nagy2020-07-081-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | The new static TLS surplus size computation is surplus_tls = 192 * (nns-1) + 144 * nns + 512 where nns is controlled via the rtld.nns tunable. This commit accounts audit modules too so nns = rtld.nns + audit modules. rtld.nns should only include the namespaces required by the application, namespaces for audit modules are accounted on top of that so audit modules don't use up the static TLS that is reserved for the application. This allows loading many audit modules without tuning rtld.nns or using up static TLS, and it fixes FAIL: elf/tst-auditmany Note that DL_NNS is currently a hard upper limit for nns, and if rtld.nns + audit modules go over the limit that's a fatal error. By default rtld.nns is 4 which allows 12 audit modules. Counting the audit modules is based on existing audit string parsing code, we cannot use GLRO(dl_naudit) before the modules are actually loaded.
* rtld: Add rtld.nns tunable for the number of supported namespacesSzabolcs Nagy2020-07-081-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TLS_STATIC_SURPLUS is 1664 bytes currently which is not enough to support DL_NNS (== 16) number of dynamic link namespaces, if we assume 192 bytes of TLS are reserved for libc use and 144 bytes are reserved for other system libraries that use IE TLS. A new tunable is introduced to control the number of supported namespaces and to adjust the surplus static TLS size as follows: surplus_tls = 192 * (rtld.nns-1) + 144 * rtld.nns + 512 The default is rtld.nns == 4 and then the surplus TLS size is the same as before, so the behaviour is unchanged by default. If an application creates more namespaces than the rtld.nns setting allows, then it is not guaranteed to work, but the limit is not checked. So existing usage will continue to work, but in the future if an application creates more than 4 dynamic link namespaces then the tunable will need to be set. In this patch DL_NNS is a fixed value and provides a maximum to the rtld.nns setting. Static linking used fixed 2048 bytes surplus TLS, this is changed so the same contract is used as for dynamic linking. With static linking DL_NNS == 1 so rtld.nns tunable is forced to 1, so by default the surplus TLS is reduced to 144 + 512 = 656 bytes. This change is not expected to cause problems. Tested on aarch64-linux-gnu and x86_64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* x86: Add thresholds for "rep movsb/stosb" to tunablesH.J. Lu2020-07-061-0/+16
| | | | | | | | | | Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables to update thresholds for "rep movsb" and "rep stosb" at run-time. Note that the user specified threshold for "rep movsb" smaller than the minimum threshold will be ignored. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* aarch64: Add Huawei Kunpeng to tunable cpu listXuelei Zhang2019-12-191-1/+1
| | | | | | | | | | | Kunpeng processer is a 64-bit Arm-compatible CPU released by Huawei, and we have already signed a copyright assignement with the FSF. This patch adds its to cpu list, and related macro for IFUNC. Checked on aarch64-linux-gnu. Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
* Add glibc.malloc.mxfast tunableDJ Delorie2019-08-091-0/+12
| | | | | | | | | | | | * elf/dl-tunables.list: Add glibc.malloc.mxfast. * manual/tunables.texi: Document it. * malloc/malloc.c (do_set_mxfast): New. (__libc_mallopt): Call it. * malloc/arena.c: Add mxfast tunable. * malloc/tst-mxfast.c: New. * malloc/Makefile: Add it. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* Small tcache improvementsWilco Dijkstra2019-05-171-1/+1
| | | | | | | | | | | | | | | | Change the tcache->counts[] entries to uint16_t - this removes the limit set by char and allows a larger tcache. Remove a few redundant asserts. bench-malloc-thread with 4 threads is ~15% faster on Cortex-A72. Reviewed-by: DJ Delorie <dj@redhat.com> * malloc/malloc.c (MAX_TCACHE_COUNT): Increase to UINT16_MAX. (tcache_put): Remove redundant assert. (tcache_get): Remove redundant asserts. (__libc_malloc): Check tcache count is not zero. * manual/tunables.texi (glibc.malloc.tcache_count): Update maximum.
* Fix tcache count maximum (BZ #24531)Wilco Dijkstra2019-05-101-2/+2
| | | | | | | | | | | | The tcache counts[] array is a char, which has a very small range and thus may overflow. When setting tcache_count tunable, there is no overflow check. However the tunable must not be larger than the maximum value of the tcache counts[] array, otherwise it can overflow when filling the tcache. [BZ #24531] * malloc/malloc.c (MAX_TCACHE_COUNT): New define. (do_set_tcache_count): Only update if count is small enough. * manual/tunables.texi (glibc.malloc.tcache_count): Document max value.
* aarch64: Add AmpereComputing emag to tunable cpu listFeng Xue2019-02-011-1/+1
| | | | | | | | | | | | | Emag is a 64-bit CPU core released by AmpereComputing. Add its name to cpu list, and corresponding macro as utilities for later IFUNC dispatch. * manual/tunables.texi (Tunable glibc.cpu.name): Add emag. * sysdeps/unix/sysv/linux/aarch64/cpu-features.c (cpu_list): Add emag. * sysdeps/unix/sysv/linux/aarch64/cpu-features.h (IS_EMAG): New macro.
* [AArch64] Add ifunc support for AresWilco Dijkstra2019-01-091-1/+1
| | | | | | | | | | | | | | | Add Ares to the midr_el0 list and support ifunc dispatch. Since Ares supports 2 128-bit loads/stores, use Neon registers for memcpy by selecting __memcpy_falkor by default (we should rename this to __memcpy_simd or similar). * manual/tunables.texi (glibc.cpu.name): Add ares tunable. * sysdeps/aarch64/multiarch/memcpy.c (__libc_memcpy): Use __memcpy_falkor for ares. * sysdeps/unix/sysv/linux/aarch64/cpu-features.h (IS_ARES): Add new define. * sysdeps/unix/sysv/linux/aarch64/cpu-features.c (cpu_list): Add ares cpu.
* Mutex: Add pthread mutex tunablesKemi Wang2018-12-011-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch does not have any functionality change, we only provide a spin count tunes for pthread adaptive spin mutex. The tunable glibc.pthread.mutex_spin_count tunes can be used by system administrator to squeeze system performance according to different hardware capabilities and workload characteristics. The maximum value of spin count is limited to 32767 to avoid the overflow of mutex->__data.__spins variable with the possible type of short in pthread_mutex_lock (). The default value of spin count is set to 100 with the reference to the previous number of times of spinning via trylock. This value would be architecture-specific and can be tuned with kinds of benchmarks to fit most cases in future. I would extend my appreciation sincerely to H.J.Lu for his help to refine this patch series. * manual/tunables.texi (POSIX Thread Tunables): New node. * nptl/Makefile (libpthread-routines): Add pthread_mutex_conf. * nptl/nptl-init.c: Include pthread_mutex_conf.h (__pthread_initialize_minimal_internal) [HAVE_TUNABLES]: Call __pthread_tunables_init. * nptl/pthreadP.h (MAX_ADAPTIVE_COUNT): Remove. (max_adaptive_count): Define. * nptl/pthread_mutex_conf.c: New file. * nptl/pthread_mutex_conf.h: New file. * sysdeps/generic/adaptive_spin_count.h: New file. * sysdeps/nptl/dl-tunables.list: New file. * nptl/pthread_mutex_lock.c (__pthread_mutex_lock): Use max_adaptive_count () not MAX_ADAPTIVE_COUNT. * nptl/pthread_mutex_timedlock.c (__pthrad_mutex_timedlock): Likewise. Suggested-by: Andi Kleen <andi.kleen@intel.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> Signed-off-by: Kemi.wang <kemi.wang@intel.com>
* Rename the glibc.tune namespace to glibc.cpuSiddhesh Poyarekar2018-08-021-20/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The glibc.tune namespace is vaguely named since it is a 'tunable', so give it a more specific name that describes what it refers to. Rename the tunable namespace to 'cpu' to more accurately reflect what it encompasses. Also rename glibc.tune.cpu to glibc.cpu.name since glibc.cpu.cpu is weird. * NEWS: Mention the change. * elf/dl-tunables.list: Rename tune namespace to cpu. * sysdeps/powerpc/dl-tunables.list: Likewise. * sysdeps/x86/dl-tunables.list: Likewise. * sysdeps/aarch64/dl-tunables.list: Rename tune.cpu to cpu.name. * elf/dl-hwcaps.c (_dl_important_hwcaps): Adjust. * elf/dl-hwcaps.h (GET_HWCAP_MASK): Likewise. * manual/README.tunables: Likewise. * manual/tunables.texi: Likewise. * sysdeps/powerpc/cpu-features.c: Likewise. * sysdeps/unix/sysv/linux/aarch64/cpu-features.c (init_cpu_features): Likewise. * sysdeps/x86/cpu-features.c: Likewise. * sysdeps/x86/cpu-features.h: Likewise. * sysdeps/x86/cpu-tunables.c: Likewise. * sysdeps/x86_64/Makefile: Likewise. * sysdeps/x86/dl-cet.c: Likewise. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
* x86/CET: Document glibc.tune.x86_ibt and glibc.tune.x86_shstkH.J. Lu2018-07-181-0/+28
| | | | | * manual/tunables.texi: Document glibc.tune.x86_ibt and glibc.tune.x86_shstk.
* manual: Fix spelling of "Auxiliary."Tobias Klauser2018-01-231-1/+1
|
* powerpc: POWER8 memcpy optimization for cached memoryAdhemerval Zanella2017-12-111-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On POWER8, unaligned memory accesses to cached memory has little impact on performance as opposed to its ancestors. It is disabled by default and will only be available when the tunable glibc.tune.cached_memopt is set to 1. __memcpy_power8_cached __memcpy_power7 ============================================================ max-size=4096: 33325.70 ( 12.65%) 38153.00 max-size=8192: 32878.20 ( 11.17%) 37012.30 max-size=16384: 33782.20 ( 11.61%) 38219.20 max-size=32768: 33296.20 ( 11.30%) 37538.30 max-size=65536: 33765.60 ( 10.53%) 37738.40 * manual/tunables.texi (Hardware Capability Tunables): Document glibc.tune.cached_memopt. * sysdeps/powerpc/cpu-features.c: New file. * sysdeps/powerpc/cpu-features.h: New file. * sysdeps/powerpc/dl-procinfo.c [!IS_IN(ldconfig)]: Add _dl_powerpc_cpu_features. * sysdeps/powerpc/dl-tunables.list: New file. * sysdeps/powerpc/ldsodefs.h: Include cpu-features.h. * sysdeps/powerpc/powerpc32/power4/multiarch/init-arch.h (INIT_ARCH): Initialize use_aligned_memopt. * sysdeps/powerpc/powerpc64/dl-machine.h [defined(SHARED && IS_IN(rtld))]: Restrict dl_platform_init availability and initialize CPU features used by tunables. * sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): Add memcpy-power8-cached. * sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Add __memcpy_power8_cached. * sysdeps/powerpc/powerpc64/multiarch/memcpy.c: Likewise. * sysdeps/powerpc/powerpc64/multiarch/memcpy-power8-cached.S: New file. Reviewed-by: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
* Add elision tunablesRogerio Alves2017-12-051-0/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds several new tunables to control the behavior of elision on supported platforms[1]. Since elision now depends on tunables, we should always *compile* with elision enabled, and leave the code disabled, but available for runtime selection. This gives us *much* better compile-time testing of the existing code to avoid bit-rot[2]. Tested on ppc, ppc64, ppc64le, s390x and x86_64. [1] This part of the patch was initially proposed by Paul Murphy but was "staled" because the framework have changed since the patch was originally proposed: https://patchwork.sourceware.org/patch/10342/ [2] This part of the patch was inititally proposed as a RFC by Carlos O'Donnell. Make sense to me integrate this on the patch: https://sourceware.org/ml/libc-alpha/2017-05/msg00335.html * elf/dl-tunables.list: Add elision parameters. * manual/tunables.texi: Add entries about elision tunable. * sysdeps/unix/sysv/linux/powerpc/elision-conf.c: Add callback functions to dynamically enable/disable elision. Add multiple callbacks functions to set elision parameters. Deleted __libc_enable_secure check. * sysdeps/unix/sysv/linux/s390/elision-conf.c: Likewise. * sysdeps/unix/sysv/linux/x86/elision-conf.c: Likewise. * configure: Regenerated. * configure.ac: Option enable_lock_elision was deleted. * config.h.in: ENABLE_LOCK_ELISION flag was deleted. * config.make.in: Remove references to enable_lock_elision. * manual/install.texi: Elision configure option was removed. * INSTALL: Regenerated to remove enable_lock_elision. * nptl/Makefile: Disable elision so it can verify error case for destroying a mutex. * sysdeps/powerpc/nptl/elide.h: Cleanup ENABLE_LOCK_ELISION check. Deleted macros for the case when ENABLE_LOCK_ELISION was not defined. * sysdeps/s390/configure: Regenerated. * sysdeps/s390/configure.ac: Remove references to enable_lock_elision.. * nptl/tst-mutex8.c: Deleted all #ifndef ENABLE_LOCK_ELISION from the test. * sysdeps/powerpc/powerpc32/sysdep.h: Deleted all ENABLE_LOCK_ELISION checks. * sysdeps/powerpc/powerpc64/sysdep.h: Likewise. * sysdeps/powerpc/sysdep.h: Likewise. * sysdeps/s390/nptl/bits/pthreadtypes-arch.h: Likewise. * sysdeps/unix/sysv/linux/powerpc/force-elision.h: Likewise. * sysdeps/unix/sysv/linux/s390/elision-conf.h: Likewise. * sysdeps/unix/sysv/linux/s390/force-elision.h: Likewise. * sysdeps/unix/sysv/linux/s390/lowlevellock.h: Likewise. * sysdeps/unix/sysv/linux/s390/Makefile: Remove references to enable-lock-elision. Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com>
* Add thunderx2t99 and thunderx2t99p1 CPU names to tunables listSteve Ellcey2017-09-081-1/+2
| | | | | | | * manual/tunables.texi (glibc.tune.cpu): Add thunderx2t99 and thunderx2t99p1 to list of cpu names. * sysdeps/unix/sysv/linux/aarch64/cpu-features.c (cpu_list): Add thunderx2t99 and thunderx2t99p1 entries to cpu_list.
* malloc: Abort on heap corruption, without a backtrace [BZ #21754]Florian Weimer2017-08-301-21/+7
| | | | | The stack trace printing caused deadlocks and has been itself been targeted by code execution exploits.