From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on
	ip-172-31-91-241.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=3.0 tests=none autolearn=ham
	autolearn_force=no version=4.0.1
Path: nntp.eternal-september.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid>
Newsgroups: comp.lang.ada,fr.comp.lang.ada
Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings
 handling)
Date: Wed, 3 Sep 2025 04:10:47 -0000 (UTC)
Organization: A noiseless patient Spider
Message-ID: <1098f46$u638$2@dont-email.me>
References: <op.vhrad6mjule2fv@garhos>
	<a8156fc2-bfbd-8199-b440-0ca9192d6936@insomnia247.nl>
	<10974d1$jn0e$1@dont-email.me> <1097smc$qe34$5@dont-email.me>
	<10981js$rmq1$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 03 Sep 2025 04:10:47 +0000 (UTC)
Injection-Info: dont-email.me; posting-host="6c0c8d65eefb29ec66a72ee525b380bb";
	logging-data="989288"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/W6erLvGylXToOl4FYcz1u"
User-Agent: Pan/0.163 (Kryvyi Rih)
Cancel-Lock: sha1:YwH9TlG7qPG9NvueTuRgt3BocL0=
Xref: feeder.eternal-september.org comp.lang.ada:67027 fr.comp.lang.ada:2363
List-Id: <comp.lang.ada>

On Tue, 2 Sep 2025 18:20:09 -0600, Alex // nytpu wrote:

> (This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a
> hellish mess caused by extreme lack of foresight and it's horrible they
> saddled everyone, including people not using UTF-16, with this crap.

I gather the basic problem was that Unicode was originally going to be a 
fixed-length 16-bit code, and that was that. And so early adopters 
(Windows NT and Java among them), built UCS-2 right into their DNA.

Until Unicode 2.0, I believe it was, where they went “on second thought, 
let’s go beyond our original brief and start including all kinds of other 
things as well” ... and UCS-2 had to become UTF-16 ...

> UTF-16 and its surrogate pairs is also what's responsible for the
> maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even
> though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird
> encoding the Chinese government came up with can all trivially encode
> full 32-bit values)

I wondered about that limit ...

> Plus conveniently Ada doesn't have routines for normalization, but can't
> hold that against it since neither does any other programming language
> because the lookup tables required are like 20 MiB even when optimized
> for space.

I think Python has them
<https://docs.python.org/3/library/unicodedata.html>. But then, on 
platforms with decent package management, that data can be shared with 
other installed packages that require it as well.

> Plus you shouldn't normalize text other than performing actions like
> substring matching, equality tests, or sorting---and even if you
> normalize when performing those, *when possible* you should store the
> unnormalized original for display/output afterwards.

I thought it was always safe to store decomposed versions of everything.