comp.lang.ada
 help / color / mirror / Atom feed
From: Alex // nytpu <nytpu@example.invalid>
Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
Date: Wed, 3 Sep 2025 11:25:10 -0600	[thread overview]
Message-ID: <1099tlm$19g42$1@dont-email.me> (raw)
In-Reply-To: <1098f46$u638$2@dont-email.me>

On 9/2/25 10:10 PM, Lawrence D’Oliveiro wrote:
> I gather the basic problem was that Unicode was originally going to be a
> fixed-length 16-bit code, and that was that. And so early adopters
> (Windows NT and Java among them), built UCS-2 right into their DNA.
> 
> Until Unicode 2.0, I believe it was, where they went “on second thought,
> let’s go beyond our original brief and start including all kinds of other
> things as well” ... and UCS-2 had to become UTF-16 ...
Yeah, they started with UCS-2 (as the only encoding) because they 
thought that 2^16 characters would be enough but then a few years later 
realized they'd run out extremely quickly even sticking solely with 
actively-used languages and with the very controversial Han unification, 
so they had to hack together the surrogate pairs to allow multiple 
planes (and at the same time they were developing UTF-8 for its 
desirable compatibility with 7-bit ASCII so they had to stick with 
UTF-16's limitations since the other encodings came later).
>> Plus conveniently Ada doesn't have routines for normalization, but can't
>> hold that against it since neither does any other programming language
>> because the lookup tables required are like 20 MiB even when optimized
>> for space.
> 
> I think Python has them
> <https://docs.python.org/3/library/unicodedata.html>. But then, on
> platforms with decent package management, that data can be shared with
> other installed packages that require it as well.
Yeah, although it's a language that is expected to have one global 
runtime used by everything; anything that's compiled (with or without a 
bundled runtime, e.g. Go) doesn't want to impose a mandatory 20 MiB 
overhead in every executable for something that's you can *usually* get 
away with not using (see also the LUTs for Unicode character classes).
>> Plus you shouldn't normalize text other than performing actions like
>> substring matching, equality tests, or sorting---and even if you
>> normalize when performing those, *when possible* you should store the
>> unnormalized original for display/output afterwards.
> 
> I thought it was always safe to store decomposed versions of everything.
Well, it depends; storing decomposed (NFD, NFKD) versions is acceptable 
IIRC (maybe not because I think it still does some limited substitution 
for "visually similar" characters, just less extreme) but usually 
pointless if you don't need to inspect the contents.  Or if you're 
storing like, a search index, then also yeah you should store normalized 
(NFC, NFKC) versions of strings.  But in general just keep the original 
form of things unless you need to inspect/compare the contents (and if 
you don't need to regularly inspect the contents then just convert it 
when needed instead of storing the normalized versions).

Just my opinion though, there's arguments either way, I just don't like 
needlessly messing with the semantics of the input data.

~nytpu

-- 
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/

      reply	other threads:[~2025-09-03 17:25 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
2010-08-21  6:21 ` Dmitry A. Kazakov
2010-08-21  7:01 ` J-P. Rosen
2010-08-21  8:12   ` Yannick Duchêne (Hibou57)
2010-08-22 18:51     ` J-P. Rosen
2010-08-22 19:48       ` Georg Bauhaus
2010-08-22 20:40         ` J-P. Rosen
2010-08-23 10:32           ` Georg Bauhaus
2010-08-23 22:28 ` Randy Brukardt
2025-08-31 17:39 ` Ada 202x; 2022; and " Nicolas Paul Colin de Glocester
2025-08-31 21:23   ` Kevin Chadwick
2025-08-31 21:27     ` Nicolas Paul Colin de Glocester
2025-09-02 16:01   ` Alex // nytpu
2025-09-02 17:40     ` Nicolas Paul Colin de Glocester
2025-09-02 18:49       ` Keith Thompson
2025-09-02 19:27         ` Nicolas Paul Colin de Glocester
2025-09-02 20:02           ` Keith Thompson
2025-09-02 17:42     ` Nicolas Paul Colin de Glocester
2025-09-02 19:15       ` Alex // nytpu
2025-09-02 19:50         ` Nicolas Paul Colin de Glocester
2025-09-02 18:08     ` Dmitry A. Kazakov
2025-09-02 19:13       ` Alex // nytpu
2025-09-02 22:56     ` Lawrence D’Oliveiro
2025-09-03  0:20       ` Alex // nytpu
2025-09-03  4:10         ` Lawrence D’Oliveiro
2025-09-03 17:25           ` Alex // nytpu [this message]
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox