From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on ip-172-31-91-241.ec2.internal X-Spam-Level: X-Spam-Status: No, score=0.0 required=3.0 tests=none autolearn=ham autolearn_force=no version=4.0.1 Path: nntp.eternal-september.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Alex // nytpu Newsgroups: comp.lang.ada,fr.comp.lang.ada Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling) Date: Tue, 2 Sep 2025 18:20:09 -0600 Organization: A noiseless patient Spider Message-ID: <10981js$rmq1$1@dont-email.me> References: <10974d1$jn0e$1@dont-email.me> <1097smc$qe34$5@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 03 Sep 2025 00:20:13 +0000 (UTC) Injection-Info: dont-email.me; posting-host="ee9161a984f35b0ea91d0b29f88b577f"; logging-data="908097"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19i+xSAIJNmIMYXWVnUm7eN" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:VkOwOihUz6Y0l1KpfSj6j6yAejI= Content-Language: en-US, en-US-large In-Reply-To: <1097smc$qe34$5@dont-email.me> Xref: feeder.eternal-september.org comp.lang.ada:67026 fr.comp.lang.ada:2362 List-Id: On 9/2/25 4:56 PM, Lawrence D’Oliveiro wrote: > On Tue, 2 Sep 2025 10:01:34 -0600, Alex // nytpu wrote: >> ... (UCS-4 has a number of additional differences from UTF-32 >> regarding "valid encodings", namely that all valid Unicode >> codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only >> Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive) >> are valid in UTF-32) ... > > So what do those codes mean in UCS-4? Unfortunately, here's where you get more complexity. So there's a difference between a valid codepoint/scalar value and an assigned scalar value. The vast majority of valid scalar values are unassigned (currently 154,998 characters are standardized out of 1,114,112 possible characters), but everything other than text renderers and normalizers should handle them like any other character to allow for at least some level of forwards compatibility when new characters are added. So in UCS-4 (or any UCS-<>) implementation, they're just treated like unassigned codepoints (that will never be assigned, not that they'd know); while they're completely invalid and should not be represented at all in UTF-32. Implementations should either error out or replace it with the substitution character U+FFFD in order to ensure that it's always working with valid UTF-32 (this is what makes the Windows character set and Ada's Wide_Strings messy, because they were originally standardized before UTF-16 so to keep backwards compatibility they still support unpaired surrogates so you have to sanitize it yourself to avoid making your UTF-8 encoder or the other software reading your text declare the encoding invalid). (This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a hellish mess caused by extreme lack of foresight and it's horrible they saddled everyone, including people not using UTF-16, with this crap. UTF-16 and its surrogate pairs is also what's responsible for the maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird encoding the Chinese government came up with can all trivially encode full 32-bit values) > This is why you have “normalization”. > Still can't just arbitrarily split strings without being careful, there are characters that are inherently multi-codepoint (e.g. most emoji among others) without the possibility to be reduced to a single codepoint like some can. Really, unfortunately, with Unicode you really just shouldn't try to make use of an "array" of any fixed-size quantity because with multi-codepoint graphemes and combining characters and such it's just not possible. Plus conveniently Ada doesn't have routines for normalization, but can't hold that against it since neither does any other programming language because the lookup tables required are like 20 MiB even when optimized for space. (Everyone says to just link to libicu, which also lets you get out of needing to keep your program's Unicode tables up-to-date when a new Unicode version releases) Plus you shouldn't normalize text other than performing actions like substring matching, equality tests, or sorting---and even if you normalize when performing those, *when possible* you should store the unnormalized original for display/output afterwards. Normalization causes lots of semantic information loss because many distinct characters are mapped onto one (e.g. non-breaking spaces and zero-width spaces are mapped to plain space, mathematical font variants and superscripts are mapped to the plain Latin/Greek versions, many different languages' characters are mapped to one if the characters happen to be visually similar, etc. etc.). ~nytpu -- Alex // nytpu https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/