From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on
	ip-172-31-91-241.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=3.0 tests=none autolearn=ham
	autolearn_force=no version=4.0.1
Path: nntp.eternal-september.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: Alex // nytpu <nytpu@example.invalid>
Newsgroups: comp.lang.ada,fr.comp.lang.ada
Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings
 handling)
Date: Tue, 2 Sep 2025 18:20:09 -0600
Organization: A noiseless patient Spider
Message-ID: <10981js$rmq1$1@dont-email.me>
References: <op.vhrad6mjule2fv@garhos>
 <a8156fc2-bfbd-8199-b440-0ca9192d6936@insomnia247.nl>
 <10974d1$jn0e$1@dont-email.me> <1097smc$qe34$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 03 Sep 2025 00:20:13 +0000 (UTC)
Injection-Info: dont-email.me; posting-host="ee9161a984f35b0ea91d0b29f88b577f";
	logging-data="908097"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19i+xSAIJNmIMYXWVnUm7eN"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:VkOwOihUz6Y0l1KpfSj6j6yAejI=
Content-Language: en-US, en-US-large
In-Reply-To: <1097smc$qe34$5@dont-email.me>
Xref: feeder.eternal-september.org comp.lang.ada:67026 fr.comp.lang.ada:2362
List-Id: <comp.lang.ada>

On 9/2/25 4:56 PM, Lawrence D’Oliveiro wrote:
> On Tue, 2 Sep 2025 10:01:34 -0600, Alex // nytpu wrote:
>> ... (UCS-4 has a number of additional differences from UTF-32
>> regarding "valid encodings", namely that all valid Unicode
>> codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only
>> Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive)
>> are valid in UTF-32) ...
> 
> So what do those codes mean in UCS-4?
Unfortunately, here's where you get more complexity.  So there's a 
difference between a valid codepoint/scalar value and an assigned scalar 
value.  The vast majority of valid scalar values are unassigned 
(currently 154,998 characters are standardized out of 1,114,112 possible 
characters), but everything other than text renderers and normalizers 
should handle them like any other character to allow for at least some 
level of forwards compatibility when new characters are added.

So in UCS-4 (or any UCS-<>) implementation, they're just treated like 
unassigned codepoints (that will never be assigned, not that they'd 
know); while they're completely invalid and should not be represented at 
all in UTF-32.  Implementations should either error out or replace it 
with the substitution character U+FFFD in order to ensure that it's 
always working with valid UTF-32 (this is what makes the Windows 
character set and Ada's Wide_Strings messy, because they were originally 
standardized before UTF-16 so to keep backwards compatibility they still 
support unpaired surrogates so you have to sanitize it yourself to avoid 
making your UTF-8 encoder or the other software reading your text 
declare the encoding invalid).

(This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a 
hellish mess caused by extreme lack of foresight and it's horrible they 
saddled everyone, including people not using UTF-16, with this crap. 
UTF-16 and its surrogate pairs is also what's responsible for the 
maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even 
though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird 
encoding the Chinese government came up with can all trivially encode 
full 32-bit values)

> This is why you have “normalization”.
> <https://www.unicode.org/faq/char_combmark.html>
Still can't just arbitrarily split strings without being careful, there 
are characters that are inherently multi-codepoint (e.g. most emoji 
among others) without the possibility to be reduced to a single 
codepoint like some can.  Really, unfortunately, with Unicode you really 
just shouldn't try to make use of an "array" of any fixed-size quantity 
because with multi-codepoint graphemes and combining characters and such 
it's just not possible.

Plus conveniently Ada doesn't have routines for normalization, but can't 
hold that against it since neither does any other programming language 
because the lookup tables required are like 20 MiB even when optimized 
for space.  (Everyone says to just link to libicu, which also lets you 
get out of needing to keep your program's Unicode tables up-to-date when 
a new Unicode version releases)

Plus you shouldn't normalize text other than performing actions like 
substring matching, equality tests, or sorting---and even if you 
normalize when performing those, *when possible* you should store the 
unnormalized original for display/output afterwards.  Normalization 
causes lots of semantic information loss because many distinct 
characters are mapped onto one (e.g. non-breaking spaces and zero-width 
spaces are mapped to plain space, mathematical font variants and 
superscripts are mapped to the plain Latin/Greek versions, many 
different languages' characters are mapped to one if the characters 
happen to be visually similar, etc. etc.).

~nytpu

-- 
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/