about summary refs log tree commit diff
path: root/src/locale
Commit message (Collapse)AuthorAgeFilesLines
* use libc-internal malloc for newlocale/freelocaleRich Felker2020-12-092-0/+10
| | | | | | | this is necessary for MT-fork correctness now that the code runs under locale lock. it would not be hard to avoid, but __get_locale is already using libc-internal malloc anyway. this can be reconsidered during locale overhaul later if needed.
* drop use of pthread_once in newlocaleRich Felker2020-12-091-9/+7
| | | | | | | in general, pthread_once is not compatible with MT-fork constraints (commit 167390f05564e0a4d3fcb4329377fd7743267560). here it actually no longer matters, because it's now called with a lock held, but since the lock is held it's pointless to use pthread_once.
* lift locale lock out of internal __get_localeRich Felker2020-12-093-18/+17
| | | | | | this allows the lock to be shared with setlocale, eliminates repeated per-category lock/unlock in newlocale, and will allow the use of pthread_once in newlocale to be dropped (to be done separately).
* lift child restrictions after multi-threaded forkRich Felker2020-11-112-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | as the outcome of Austin Group tracker issue #62, future editions of POSIX have dropped the requirement that fork be AS-safe. this allows but does not require implementations to synchronize fork with internal locks and give forked children of multithreaded parents a partly or fully unrestricted execution environment where they can continue to use the standard library (per POSIX, they can only portably use AS-safe functions). up until recently, taking this allowance did not seem desirable. however, commit 8ed2bd8bfcb4ea6448afb55a941f4b5b2b0398c0 exposed the extent to which applications and libraries are depending on the ability to use malloc and other non-AS-safe interfaces in MT-forked children, by converting latent very-low-probability catastrophic state corruption into predictable deadlock. dealing with the fallout has been a huge burden for users/distros. while it looks like most of the non-portable usage in applications could be fixed given sufficient effort, at least some of it seems to occur in language runtimes which are exposing the ability to run unrestricted code in the child as part of the contract with the programmer. any attempt at fixing such contracts is not just a technical problem but a social one, and is probably not tractable. this patch extends the fork function to take locks for all libc singletons in the parent, and release or reset those locks in the child, so that when the underlying fork operation takes place, the state protected by these locks is consistent and ready for the child to use. locking is skipped in the case where the parent is single-threaded so as not to interfere with legacy AS-safety property of fork in single-threaded programs. lock order is mostly arbitrary, but the malloc locks (including bump allocator in case it's used) must be taken after the locks on any subsystems that might use malloc, and non-AS-safe locks cannot be taken while the thread list lock is held, imposing a requirement that it be taken last.
* convert malloc use under libc-internal locks to use internal allocatorRich Felker2020-11-112-0/+11
| | | | | | | | | | | | | this change lifts undocumented restrictions on calls by replacement mallocs to libc functions that might take these locks, and sets the stage for lifting restrictions on the child execution environment after multithreaded fork. care is taken to #define macros to replace all four functions (malloc, calloc, realloc, free) even if not all of them will be used, using an undefined symbol name for the ones intended not to be used so that any inadvertent future use will be caught at compile time rather than directed to the wrong implementation.
* fix MUSL_LOCPATH searchRich Felker2020-08-221-1/+1
| | | | all path elements but the last had the final byte truncated.
* fix accidentlly-external cmp symbol introduced with catgetsRich Felker2019-08-131-1/+1
| | | | commit 7590203c486d9002522019045d34ee3dee0a66f5 omitted static here.
* add non-stub implementation of catgets localization functionsRich Felker2019-08-073-3/+114
| | | | | | | | | | | | these accept the netbsd/openbsd message catalog file format, consisting of a sorted list of set headers and a sorted list of message headers for each set, admitting trivial binary search for lookups. the gnu format was not chosen because it's unusably bad. it does not admit efficient (log time or better) lookups; rather, it requires linear search or hash table lookups, and the hash function is awful: it's literally set_id*msg_id.
* locale: ensure dcngettext() preserves errnoA. Wilcox2019-02-071-0/+3
| | | | | | | | | | | | Some packages call gettext to format a message to be sent to perror. If the currently set user locale points to a non-existent .mo file, open via __map_file in dcngettext will set errno to ENOENT. Maintainer's notes: Non-modification of errno is a documented part of the interface contract for the GNU version of this function and likely other versions. The issue being fixed here seems to be a regression from commit 1b52863e244ecee5b5935b6d36bb9e6efe84c035, which enabled setting of errno from __map_file.
* fix regression in setlocale for LC_ALL with per-category settingRich Felker2018-11-021-1/+1
| | | | | | commit d88e5dfa8b989dafff4b748bfb3cba3512c8482e inadvertently changed the argument pased to __get_locale from part (the current ;-delimited component) to name (the full string).
* make the default locale (& a variant) failure-free cases for newlocaleRich Felker2018-10-221-1/+20
| | | | | | | | | | | | | | | commit aeeac9ca5490d7d90fe061ab72da446c01ddf746 introduced fail-safe invariants that creating a locale_t object for the C locale or C.UTF-8 locale will always succeed. extend the guarantee to also cover the following: - newlocale(LC_ALL_MASK, "", 0) - newlocale(LC_ALL_MASK-LC_CTYPE_MASK, "C", 0) provided that the LANG/LC_* environment variables have not been changed by the program. these usages are idiomatic for getting the default locale, and for getting a locale that behaves as the C locale except for honoring the default locale's character encoding.
* simplify newlocale and allow failure for explicit locale namesRich Felker2018-10-221-23/+14
| | | | | | | | | | | | unify the code paths for allocated and non-allocated locale objects, always using a tmp object. this is necessary to avoid clobbering the base locale object too soon if we allow for the possibility that looking up an explicitly requested locale name may fail, and makes the code simpler and cleaner anyway. eliminate the complex and fragile logic for checking whether one of the non-allocated locale objects can be used for the result, and instead just memcmp against each of them.
* adapt setlocale to support possibility of failureRich Felker2018-10-201-12/+20
| | | | | | | introduce a new LOC_MAP_FAILED sentinel for errors, since null pointers for a category's locale map indicate the C locale. at this time, __get_locale does not fail, so there should be no functional change by this commit.
* drop lazy plural forms init in dcngettextRich Felker2018-09-141-18/+17
| | | | | | | | | | | there is no good reason to wait to find and process the plural rules for a translated message file until a gettext form requesting plural rule processing is used. it just imposes additional synchronization, here in the form of clunky use of atomics. it looks like there may also have been a race condition where nplurals could be seen without plural_rule being seen, possibly leading to null pointer dereference. if so, this commit fixes it.
* split internal lock API out of libc.h, creating lock.hRich Felker2018-09-123-2/+3
| | | | | | | | | this further reduces the number of source files which need to include libc.h and thereby be potentially exposed to libc global state and internals. this will also facilitate further improvements like adding an inline fast-path, if we want to do so later.
* reduce spurious inclusion of libc.hRich Felker2018-09-1210-11/+0
| | | | | | | | | | | | | | | | | | | | | libc.h was intended to be a header for access to global libc state and related interfaces, but ended up included all over the place because it was the way to get the weak_alias macro. most of the inclusions removed here are places where weak_alias was needed. a few were recently introduced for hidden. some go all the way back to when libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented) cancellation points had to include it. remaining spurious users are mostly callers of the LOCK/UNLOCK macros and files that use the LFS64 macro to define the awful *64 aliases. in a few places, new inclusion of libc.h is added because several internal headers no longer implicitly include libc.h. declarations for __lockfile and __unlockfile are moved from libc.h to stdio_impl.h so that the latter does not need libc.h. putting them in libc.h made no sense at all, since the macros in stdio_impl.h are needed to use them correctly anyway.
* apply hidden visibility to various remaining internal interfacesRich Felker2018-09-121-1/+3
|
* overhaul internally-public declarations using wrapper headersRich Felker2018-09-123-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commits leading up to this one have moved the vast majority of libc-internal interface declarations to appropriate internal headers, allowing them to be type-checked and setting the stage to limit their visibility. the ones that have not yet been moved are mostly namespace-protected aliases for standard/public interfaces, which exist to facilitate implementing plain C functions in terms of POSIX functionality, or C or POSIX functionality in terms of extensions that are not standardized. some don't quite fit this description, but are "internally public" interfacs between subsystems of libc. rather than create a number of newly-named headers to declare these functions, and having to add explicit include directives for them to every source file where they're needed, I have introduced a method of wrapping the corresponding public headers. parallel to the public headers in $(srcdir)/include, we now have wrappers in $(srcdir)/src/include that come earlier in the include path order. they include the public header they're wrapping, then add declarations for namespace-protected versions of the same interfaces and any "internally public" interfaces for the subsystem they correspond to. along these lines, the wrapper for features.h is now responsible for the definition of the hidden, weak, and weak_alias macros. this means source files will no longer need to include any special headers to access these features. over time, it is my expectation that the scope of what is "internally public" will expand, reducing the number of source files which need to include *_impl.h and related headers down to those which are actually implementing the corresponding subsystems, not just using them.
* add internal header for declaring __pleval function (used by gettext)Rich Felker2018-09-123-1/+8
| | | | | locale_impl.h could have been used, but this function is completely independent of anything else, and preserving that property seems nice.
* move __loc_is_allocated declaration to locale_impl.hRich Felker2018-09-121-2/+0
|
* fix output size handling for multi-unicode-char big5-hkscs charactersRich Felker2018-06-011-5/+13
| | | | | | | | | | | | | | | | | | | | since this iconv implementation's output is stateless, it's necessary to know before writing anything to the output buffer whether the conversion of the current input character will fit. previously we used a hard-coded table of the output size needed for each supported output encoding, but failed to update the table when adding support for conversion to jis-based encodings and again when adding separate encoding identifiers for implicit-endianness utf-16/32 and ucs-2/4 variants, resulting in out-of-bound table reads and incorrect size checks. no buffer overflow was possible, but the affected characters could be converted incorrectly, and iconv could potentially produce an incorrect return value as a result. remove the hard-coded table, and instead perform the recursive iconv conversion to a temporary buffer, measuring the output size and transferring it to the actual output buffer only if the whole converted result fits.
* fix iconv mapping of big5-hkscs characters that map to two unicode charsRich Felker2018-06-011-1/+1
| | | | | | | | | | | | | this case is handled with a recursive call to iconv using a specially-constructed conversion descriptor. the constant 0 was used as the offset for utf-8, since utf-8 appears first in the charmaps table, but the offset used needs to point into the charmap entry, past the name/aliases at the beginning, to the byte identifying the encoding. as a result of this error, junk was produced. instead, call find_charmap so we don't have to hard-code a nontrivial offset. with this change, the code has been tested and found to work in the case of converting the affected hkscs characters to utf-8.
* fix iconv conversion to UTF-32 with implicit (big) endiannessWill Dietz2018-05-091-0/+2
| | | | | | | | | | | maintainer's notes: commit 95c6044e2ae85846330814c4ac5ebf4102dbe02c split UTF-32 and UTF-32BE but neglected to add a case for the former as a destination encoding, resulting in it wrongly being handled by the default case. the intent was that the value of the macro be chosen to encode "big endian" in the low bits, so that no code would be needed, but this was botched; instead, handle it the way UCS2 is handled.
* fix iconv buffer overflow converting to legacy JIS-based encodingsWill Dietz2018-05-091-0/+1
| | | | | | | | | | maintainer's notes: commit a223dbd27ae36fe53f9f67f86caf685b729593fc added the reverse conversions to JIS-based encodings, but omitted the check for remining buffer space in the case where the next character to be written was single-byte, allowing conversion to continue past the end of the destination buffer.
* fix nl_langinfo_l(CODESET, loc) reporting wrong locale's valueRich Felker2018-03-071-1/+1
| | | | | | use of MB_CUR_MAX encoded a hidden dependency on the currently active locale for the calling thread, whereas nl_langinfo_l is supposed to report for the locale passed as an argument.
* revise the definition of multiple basic locks in the codeJens Gustedt2018-01-093-3/+3
| | | | In all cases this is just a change from two volatile int to one.
* fix iconv output of surrogate pairs in ucs2Rich Felker2017-12-181-1/+1
| | | | | | in the unified code for handling utf-16 and ucs2 output, the check for ucs2 wrongly looked at the source charset rather than the destination charset.
* add support for BOM-determined-endian UCS2, UTF-16, and UTF-32 to iconvRich Felker2017-12-181-3/+40
| | | | | | | | | | | | previously, the charset names without endianness specified were always interpreted as big endian. unicode specifies that UTF-16 and UTF-32 have BOM-determined endianness if BOM is present, and are otherwise big endian. since commit 5b546faa67544af395d6407553762b37e9711157 added support for stateful encodings, it is now possible to implement BOM support via the conversion descriptor state. for conversions to these charsets, the output is always big endian and does not have a BOM.
* add cp866 (dos cyrillic) to iconvRich Felker2017-12-181-0/+12
|
* add ibm1047 codepage (ebcdic representation of latin1) to iconvRich Felker2017-12-121-0/+20
|
* add reverse iconv mappings for JIS-based encodingsRich Felker2017-11-142-1/+612
| | | | | | | these encodings are still commonly used in messaging protocols and such. the reverse mapping is implemented as a binary search of a list of the jis 0208 characters in unicode order; the existing forward table is used to perform the comparison in the search.
* generalize iconv framework for 8-bit codepagesRich Felker2017-11-133-246/+273
| | | | | | | | | | | | | | | | | | | previously, 8-bit codepages could only remap the high 128 bytes; the low range was assumed/forced to agree with ascii. interpretation of codepage table headers has been changed so that it's possible to represent mappings for up to 256 slots (fewer if the initial portion of the map is elided because it coincides with unicode codepoints). this requires consuming a bit more of the 10-bit space of characters that can be represented in 8-bit codepages, but there's still a plenty left. the size of the legacy_chars table is actually reduced now by eliding the first 256 entries and considering them to map implicitly via the identity map. before these changes, there seem to have been minor bugs/omissions in codepage table generation, so it's likely that some actual bug fixes are silently included in this commit. round-trip testing of a few codepages was performed on the new version of the code, but no differential testing against the old version was done.
* reformat cjk iconv tables to be diff-friendly, match tool outputRich Felker2017-11-113-2755/+2808
| | | | | | | | | | | | the new version of the code used to generate these tables forces a newline every 256 entries, whereas at the time these files were originally generated and committed, it only wrapped them at 80 columns. the new behavior ensures that localized changes to the tables, if they are ever needed, will produce localized diffs. other tables including hkscs were already committed in the new format. binary comparison of the generated object files was performed to confirm that no spurious changes slipped in.
* add iso-2022-jp support (decoding only) to iconvRich Felker2017-11-101-2/+45
| | | | | | | | | | this implementation aims to match the baseline defined by rfc1468 (the original mime charset definition) plus the halfwidth katakana extension included in the whatwg definition of the charset. rejection of si/so controls and newlines in doublebyte state are not currently enforced. the jis x 0201 mode is currently interpreted as having the yen sign and overline character in place of backslash and tilde; ascii mode has the standard ascii characters in those slots.
* add iconv framework for decoding stateful encodingsRich Felker2017-11-102-3/+24
| | | | | | | assuming pointers obtained from malloc have some nonzero alignment, repurpose the low bit of iconv_t as an indicator that the descriptor is a stateless value representing the source and destination character encodings.
* simplify/optimize iconv utf-8 caseRich Felker2017-11-101-4/+3
| | | | | | the special case where mbrtowc returns 0 but consumed 1 byte of input does not need to be considered, because the short-circuit for low bytes already covered that case.
* handle ascii range individually in each iconv caseRich Felker2017-11-101-2/+10
| | | | | | | short-circuiting low bytes before the switch precluded support for character encodings that don't coincide with ascii in this range. this limitation affected iso-2022 encodings, which use the esc byte to introduce a shift sequence, and things like ebcdic.
* move iconv_close to its own translation unitRich Felker2017-11-102-5/+6
| | | | | | | this is in preparation to support stateful conversion descriptors, which are necessarily allocated and thus must be freed in iconv_close. putting it in a separate TU will avoid pulling in free if iconv_close is not referenced.
* refactor iconv conversion descriptor encoding/decodingRich Felker2017-11-101-6/+20
| | | | | | | | | | this change is made to avoid having assumptions about the encoding spread out across the file, and to facilitate future change to a form that can accommodate allocted, stateful descriptors when needed. this commit should not produce any functional changes; with the compiler tested the only change to code generation was minor reordering of local variables on stack.
* add _NL_LOCALE_NAME extension to nl_langinfoRich Felker2017-07-311-0/+4
| | | | | | | | | | | | | | | | | since setlocale(cat, NULL) is required to return the setting for the global locale, there is no standard mechanism to obtain the name of the currently active thread-local locale set by uselocale. this makes it impossible for application/library software to load appropriate translations, etc. unless using the gettext implementation provided by libc, which has privileged access to libc internals. to fill this gap, glibc introduced the _NL_LOCALE_NAME macro which can be used with nl_langinfo to obtain the name. GNU gettext/gnulib code already use this functionality on glibc, and can easily be adapted to make use of it on non-glibc systems if it's available; for other systems they poke at locale implementation internals, which we want to avoid. this patch provides a compatible interface to the one glibc introduced.
* fix missing volatile qualifier on lock in __get_localeJens Gustedt2017-07-041-1/+1
|
* fix iconv conversions for iso88592-iso885916Bartosz Brachaczek2017-06-201-1/+1
| | | | | | | commit 97bd6b09dbe7478d5a90a06ecd9e5b59389d8eb9 refactored the table lookup into a function and introduced an error in index computation. the error caused garbage to be read from the table if the given charmap had a non-zero number of elided entries.
* catopen: set errno to EOPNOTSUPPA. Wilcox2017-06-141-0/+2
| | | | | | | Per 1003.1-2008 (2016 ed.), catopen must set errno on failure. We set errno to EOPNOTSUPP because musl does not currently support message catalogues.
* fix iconv conversions to legacy 8bit encodingsRich Felker2017-05-271-9/+12
| | | | | | | | | | there was missing reverse-conversion logic for the case, handled specially in the character set tables, where a byte represents a unicode codepoint with the same value. this patch adds code to handle the case, and refactors the two-level 10-bit table lookup for legacy character sets into a function to avoid repeating it yet another time as part of the fix.
* search locale name variants for gettext translationsRich Felker2017-03-211-32/+55
| | | | | | | | | | | | | | | | | | | | | | | often translations will be named only by language, whereas locale names may also include a territory code, modifier, and codeset portion. previously, only translations exactly matching the locale name were loaded. this was a major usability issue, requiring workarounds like symlinks or tweaking of the locale name. with these changes, gettext now searches for translations by first removing the codeset portion of the locale name, then trying the remainder in full, with modifier (@mod) removed, with territory code (_XX) removed, and with both removed. part of the reason gettext lacked support for searching fallbacks before is that the candidate pathname for a translation file was constructed on each call and used as the key to lookup an already-mapped translation file. this was very costly/inefficient. we now use the tuple of textdomain binding pointer, locale map pointer, and integer category id as the key for looking up a translation file mapping. based on patch by He X.
* make setlocale return a single name for LC_ALL if all categories matchRich Felker2017-03-211-2/+5
| | | | | | | | | when called for LC_ALL, setlocale has to return a string representing the state of all locale categories. the simplest way to do this was to always return a delimited list of values for each category, but that's not friendly in the fairly common case where all categories have the same setting. He X proposed a patch to check for this case and return a single name; this patch is a simplified approach to do the same.
* avoid unbounded strlen in gettext functionsRich Felker2017-01-291-3/+3
| | | | | use the standard strnlen idiom for cases where lengths greater than an imposed limit are going to be rejected immediately anyway.
* fix use of uninitialized pointer in gettext coreRich Felker2017-01-291-2/+2
| | | | | | | | | | | | | | | | | | | the plural_rule field of allocated msgcat structures was assumed to be initially-null but was never initialized. for future-proofing, the nplurals field which was left uninitialized should also be cleared. likewise, in the binding structure, the active field could be used uninitialized by a technicality: the a_store which stores the initial value of 0 may be implemented as a cas operation, which reads the old value. rather than fixing these issues individually, just use calloc for both allocations. this does result in wasteful clearing of name buffers (up to NAME_MAX+PATH_MAX) before filling them, but since the size if bounded and the time is dominated by filesystem operations, it really doesn't matter; simplicity and future-proofing have more value here. modified from patch submitted by He X.
* fix bindtextdomain logic error deactivating other domainsRich Felker2017-01-291-1/+1
| | | | | | | | this loop was only supposed to deactivate other bindings for the same text domain name, but due to copy-and-paste error, deactivated all other bindings. patch by He X.
* fix return value of nl_langinfo for invalid item argumentsRich Felker2015-11-101-5/+5
| | | | it was wrongly returning a null pointer instead of an empty string.