From: Alex // nytpu <nytpu@example.invalid>
Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
Date: Tue, 2 Sep 2025 18:20:09 -0600 [thread overview]
Message-ID: <10981js$rmq1$1@dont-email.me> (raw)
In-Reply-To: <1097smc$qe34$5@dont-email.me>
On 9/2/25 4:56 PM, Lawrence D’Oliveiro wrote:
> On Tue, 2 Sep 2025 10:01:34 -0600, Alex // nytpu wrote:
>> ... (UCS-4 has a number of additional differences from UTF-32
>> regarding "valid encodings", namely that all valid Unicode
>> codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only
>> Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive)
>> are valid in UTF-32) ...
>
> So what do those codes mean in UCS-4?
Unfortunately, here's where you get more complexity. So there's a
difference between a valid codepoint/scalar value and an assigned scalar
value. The vast majority of valid scalar values are unassigned
(currently 154,998 characters are standardized out of 1,114,112 possible
characters), but everything other than text renderers and normalizers
should handle them like any other character to allow for at least some
level of forwards compatibility when new characters are added.
So in UCS-4 (or any UCS-<>) implementation, they're just treated like
unassigned codepoints (that will never be assigned, not that they'd
know); while they're completely invalid and should not be represented at
all in UTF-32. Implementations should either error out or replace it
with the substitution character U+FFFD in order to ensure that it's
always working with valid UTF-32 (this is what makes the Windows
character set and Ada's Wide_Strings messy, because they were originally
standardized before UTF-16 so to keep backwards compatibility they still
support unpaired surrogates so you have to sanitize it yourself to avoid
making your UTF-8 encoder or the other software reading your text
declare the encoding invalid).
(This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a
hellish mess caused by extreme lack of foresight and it's horrible they
saddled everyone, including people not using UTF-16, with this crap.
UTF-16 and its surrogate pairs is also what's responsible for the
maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even
though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird
encoding the Chinese government came up with can all trivially encode
full 32-bit values)
> This is why you have “normalization”.
> <https://www.unicode.org/faq/char_combmark.html>
Still can't just arbitrarily split strings without being careful, there
are characters that are inherently multi-codepoint (e.g. most emoji
among others) without the possibility to be reduced to a single
codepoint like some can. Really, unfortunately, with Unicode you really
just shouldn't try to make use of an "array" of any fixed-size quantity
because with multi-codepoint graphemes and combining characters and such
it's just not possible.
Plus conveniently Ada doesn't have routines for normalization, but can't
hold that against it since neither does any other programming language
because the lookup tables required are like 20 MiB even when optimized
for space. (Everyone says to just link to libicu, which also lets you
get out of needing to keep your program's Unicode tables up-to-date when
a new Unicode version releases)
Plus you shouldn't normalize text other than performing actions like
substring matching, equality tests, or sorting---and even if you
normalize when performing those, *when possible* you should store the
unnormalized original for display/output afterwards. Normalization
causes lots of semantic information loss because many distinct
characters are mapped onto one (e.g. non-breaking spaces and zero-width
spaces are mapped to plain space, mathematical font variants and
superscripts are mapped to the plain Latin/Greek versions, many
different languages' characters are mapped to one if the characters
happen to be visually similar, etc. etc.).
~nytpu
--
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/
next prev parent reply other threads:[~2025-09-03 0:20 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
2010-08-21 6:21 ` Dmitry A. Kazakov
2010-08-21 7:01 ` J-P. Rosen
2010-08-21 8:12 ` Yannick Duchêne (Hibou57)
2010-08-22 18:51 ` J-P. Rosen
2010-08-22 19:48 ` Georg Bauhaus
2010-08-22 20:40 ` J-P. Rosen
2010-08-23 10:32 ` Georg Bauhaus
2010-08-23 22:28 ` Randy Brukardt
2025-08-31 17:39 ` Ada 202x; 2022; and " Nicolas Paul Colin de Glocester
2025-08-31 21:23 ` Kevin Chadwick
2025-08-31 21:27 ` Nicolas Paul Colin de Glocester
2025-09-02 16:01 ` Alex // nytpu
2025-09-02 17:40 ` Nicolas Paul Colin de Glocester
2025-09-02 18:49 ` Keith Thompson
2025-09-02 19:27 ` Nicolas Paul Colin de Glocester
2025-09-02 20:02 ` Keith Thompson
2025-09-02 17:42 ` Nicolas Paul Colin de Glocester
2025-09-02 19:15 ` Alex // nytpu
2025-09-02 19:50 ` Nicolas Paul Colin de Glocester
2025-09-02 18:08 ` Dmitry A. Kazakov
2025-09-02 19:13 ` Alex // nytpu
2025-09-02 22:56 ` Lawrence D’Oliveiro
2025-09-03 0:20 ` Alex // nytpu [this message]
2025-09-03 4:10 ` Lawrence D’Oliveiro
2025-09-03 17:25 ` Alex // nytpu
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox