From: "Lawrence D’Oliveiro" <ldo@nz.invalid>
Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
Date: Wed, 3 Sep 2025 04:10:47 -0000 (UTC) [thread overview]
Message-ID: <1098f46$u638$2@dont-email.me> (raw)
In-Reply-To: 10981js$rmq1$1@dont-email.me
On Tue, 2 Sep 2025 18:20:09 -0600, Alex // nytpu wrote:
> (This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a
> hellish mess caused by extreme lack of foresight and it's horrible they
> saddled everyone, including people not using UTF-16, with this crap.
I gather the basic problem was that Unicode was originally going to be a
fixed-length 16-bit code, and that was that. And so early adopters
(Windows NT and Java among them), built UCS-2 right into their DNA.
Until Unicode 2.0, I believe it was, where they went “on second thought,
let’s go beyond our original brief and start including all kinds of other
things as well” ... and UCS-2 had to become UTF-16 ...
> UTF-16 and its surrogate pairs is also what's responsible for the
> maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even
> though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird
> encoding the Chinese government came up with can all trivially encode
> full 32-bit values)
I wondered about that limit ...
> Plus conveniently Ada doesn't have routines for normalization, but can't
> hold that against it since neither does any other programming language
> because the lookup tables required are like 20 MiB even when optimized
> for space.
I think Python has them
<https://docs.python.org/3/library/unicodedata.html>. But then, on
platforms with decent package management, that data can be shared with
other installed packages that require it as well.
> Plus you shouldn't normalize text other than performing actions like
> substring matching, equality tests, or sorting---and even if you
> normalize when performing those, *when possible* you should store the
> unnormalized original for display/output afterwards.
I thought it was always safe to store decomposed versions of everything.
next prev parent reply other threads:[~2025-09-03 4:10 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
2010-08-21 6:21 ` Dmitry A. Kazakov
2010-08-21 7:01 ` J-P. Rosen
2010-08-21 8:12 ` Yannick Duchêne (Hibou57)
2010-08-22 18:51 ` J-P. Rosen
2010-08-22 19:48 ` Georg Bauhaus
2010-08-22 20:40 ` J-P. Rosen
2010-08-23 10:32 ` Georg Bauhaus
2010-08-23 22:28 ` Randy Brukardt
2025-08-31 17:39 ` Ada 202x; 2022; and " Nicolas Paul Colin de Glocester
2025-08-31 21:23 ` Kevin Chadwick
2025-08-31 21:27 ` Nicolas Paul Colin de Glocester
2025-09-02 16:01 ` Alex // nytpu
2025-09-02 17:40 ` Nicolas Paul Colin de Glocester
2025-09-02 18:49 ` Keith Thompson
2025-09-02 19:27 ` Nicolas Paul Colin de Glocester
2025-09-02 20:02 ` Keith Thompson
2025-09-02 17:42 ` Nicolas Paul Colin de Glocester
2025-09-02 19:15 ` Alex // nytpu
2025-09-02 19:50 ` Nicolas Paul Colin de Glocester
2025-09-02 18:08 ` Dmitry A. Kazakov
2025-09-02 19:13 ` Alex // nytpu
2025-09-02 22:56 ` Lawrence D’Oliveiro
2025-09-03 0:20 ` Alex // nytpu
2025-09-03 4:10 ` Lawrence D’Oliveiro [this message]
2025-09-03 17:25 ` Alex // nytpu
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox