From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on
	ip-172-31-91-241.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=0.0 required=3.0 tests=none autolearn=ham
	autolearn_force=no version=4.0.1
Path: nntp.eternal-september.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: Alex // nytpu <nytpu@example.invalid>
Newsgroups: comp.lang.ada,fr.comp.lang.ada
Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings
 handling)
Date: Wed, 3 Sep 2025 11:25:10 -0600
Organization: A noiseless patient Spider
Message-ID: <1099tlm$19g42$1@dont-email.me>
References: <op.vhrad6mjule2fv@garhos>
 <a8156fc2-bfbd-8199-b440-0ca9192d6936@insomnia247.nl>
 <10974d1$jn0e$1@dont-email.me> <1097smc$qe34$5@dont-email.me>
 <10981js$rmq1$1@dont-email.me> <1098f46$u638$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 03 Sep 2025 17:25:11 +0000 (UTC)
Injection-Info: dont-email.me; posting-host="9dccf2af25a5c70fac7f16ae550e4f98";
	logging-data="1360002"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/UZtoyPy9P8lU+KKD3prdg"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:+l+TEtdeMmdzK1LYM+SkGqHkFB8=
Content-Language: en-US, en-US-large
In-Reply-To: <1098f46$u638$2@dont-email.me>
Xref: feeder.eternal-september.org comp.lang.ada:67028 fr.comp.lang.ada:2364
List-Id: <comp.lang.ada>

On 9/2/25 10:10 PM, Lawrence D’Oliveiro wrote:
> I gather the basic problem was that Unicode was originally going to be a
> fixed-length 16-bit code, and that was that. And so early adopters
> (Windows NT and Java among them), built UCS-2 right into their DNA.
> 
> Until Unicode 2.0, I believe it was, where they went “on second thought,
> let’s go beyond our original brief and start including all kinds of other
> things as well” ... and UCS-2 had to become UTF-16 ...
Yeah, they started with UCS-2 (as the only encoding) because they 
thought that 2^16 characters would be enough but then a few years later 
realized they'd run out extremely quickly even sticking solely with 
actively-used languages and with the very controversial Han unification, 
so they had to hack together the surrogate pairs to allow multiple 
planes (and at the same time they were developing UTF-8 for its 
desirable compatibility with 7-bit ASCII so they had to stick with 
UTF-16's limitations since the other encodings came later).
>> Plus conveniently Ada doesn't have routines for normalization, but can't
>> hold that against it since neither does any other programming language
>> because the lookup tables required are like 20 MiB even when optimized
>> for space.
> 
> I think Python has them
> <https://docs.python.org/3/library/unicodedata.html>. But then, on
> platforms with decent package management, that data can be shared with
> other installed packages that require it as well.
Yeah, although it's a language that is expected to have one global 
runtime used by everything; anything that's compiled (with or without a 
bundled runtime, e.g. Go) doesn't want to impose a mandatory 20 MiB 
overhead in every executable for something that's you can *usually* get 
away with not using (see also the LUTs for Unicode character classes).
>> Plus you shouldn't normalize text other than performing actions like
>> substring matching, equality tests, or sorting---and even if you
>> normalize when performing those, *when possible* you should store the
>> unnormalized original for display/output afterwards.
> 
> I thought it was always safe to store decomposed versions of everything.
Well, it depends; storing decomposed (NFD, NFKD) versions is acceptable 
IIRC (maybe not because I think it still does some limited substitution 
for "visually similar" characters, just less extreme) but usually 
pointless if you don't need to inspect the contents.  Or if you're 
storing like, a search index, then also yeah you should store normalized 
(NFC, NFKC) versions of strings.  But in general just keep the original 
form of things unless you need to inspect/compare the contents (and if 
you don't need to regularly inspect the contents then just convert it 
when needed instead of storing the normalized versions).

Just my opinion though, there's arguments either way, I just don't like 
needlessly messing with the semantics of the input data.

~nytpu

-- 
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/