overhaul wide character case mapping implementation - mirror/musl

diff options

author	Rich Felker <dalias@aerifal.cx>	2019-10-25 12:33:17 -0400
committer	Rich Felker <dalias@aerifal.cx>	2019-10-25 12:39:40 -0400
commit	a11a6246c64b186c5547b9527a768a56f3d6b281 (patch)
tree	f651892a6075900041d0b29abd9ac12e8a60c96a /src/thread/microblaze/__set_thread_area.s
parent	e8aba58ab19a18f83d7f78e80d5e4f51e7e4e8a9 (diff)
download	musl-a11a6246c64b186c5547b9527a768a56f3d6b281.tar.gz musl-a11a6246c64b186c5547b9527a768a56f3d6b281.tar.xz musl-a11a6246c64b186c5547b9527a768a56f3d6b281.zip

overhaul wide character case mapping implementation

the existing implementation of case mappings was very small (typically
around 1.5k), but unmaintainable, requiring manual addition of new
case mappings with each new edition of Unicode. often, it turned out
that newly-added case mappings were not easily representable in the
existing tightly-constrained table structures, requiring new hacks to
be invented and delaying support for new characters.

the new implementation added here follows the pattern used for
character class membership, with a two-level table allowing Unicode
blocks for which no data is needed to be elided. however, rather than
single-bit data, each character maps to a one of up to 6 case-mapping
rules available to its block, where 6 is floor(cbrt(256)) and allow 3
characters to be represented per byte (vs 8 with bit tables). blocks
that would need more than 6 rules designate one as an exception and
let lookup pass into a binary search of exceptional cases for the
block.

the number 6 was chosen empirically; many blocks would be ok with 4
rules (uncased, lower, upper, possible exceptions), some even just
with 2, but the latter are rare and fitting 4 characters per byte
rather than 3 does not save significant space. moreover, somewhat
surprisingly, there are sufficiently many blocks where even 4 rules
don't suffice without a lot of exceptions (blocks where some case
pairs are laced, others offset) that originally I was looking at
supporting variable-width tables, with 1-, 2-, or 3-bit entries,
thereby allowing blocks with 8 rules. as implemented in my
experiments, that version was significantly larger and involved more
memory accesses/cache lines.

improvements in size at the expense of some performance might be
possible by utilizing iswalpha data or merging the table of case
mapping identity with alphabetic identity. these were explored
somewhat when the code was first written, and might be worth
revisiting in the future.

Diffstat (limited to 'src/thread/microblaze/__set_thread_area.s')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: