From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on ip-172-31-91-241.ec2.internal X-Spam-Level: X-Spam-Status: No, score=0.0 required=3.0 tests=none autolearn=ham autolearn_force=no version=4.0.1 Path: nntp.eternal-september.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Lawrence =?iso-8859-13?q?D=FFOliveiro?= Newsgroups: comp.lang.ada,fr.comp.lang.ada Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling) Date: Wed, 3 Sep 2025 04:10:47 -0000 (UTC) Organization: A noiseless patient Spider Message-ID: <1098f46$u638$2@dont-email.me> References: <10974d1$jn0e$1@dont-email.me> <1097smc$qe34$5@dont-email.me> <10981js$rmq1$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Wed, 03 Sep 2025 04:10:47 +0000 (UTC) Injection-Info: dont-email.me; posting-host="6c0c8d65eefb29ec66a72ee525b380bb"; logging-data="989288"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/W6erLvGylXToOl4FYcz1u" User-Agent: Pan/0.163 (Kryvyi Rih) Cancel-Lock: sha1:YwH9TlG7qPG9NvueTuRgt3BocL0= Xref: feeder.eternal-september.org comp.lang.ada:67027 fr.comp.lang.ada:2363 List-Id: On Tue, 2 Sep 2025 18:20:09 -0600, Alex // nytpu wrote: > (This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a > hellish mess caused by extreme lack of foresight and it's horrible they > saddled everyone, including people not using UTF-16, with this crap. I gather the basic problem was that Unicode was originally going to be a fixed-length 16-bit code, and that was that. And so early adopters (Windows NT and Java among them), built UCS-2 right into their DNA. Until Unicode 2.0, I believe it was, where they went “on second thought, let’s go beyond our original brief and start including all kinds of other things as well” ... and UCS-2 had to become UTF-16 ... > UTF-16 and its surrogate pairs is also what's responsible for the > maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even > though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird > encoding the Chinese government came up with can all trivially encode > full 32-bit values) I wondered about that limit ... > Plus conveniently Ada doesn't have routines for normalization, but can't > hold that against it since neither does any other programming language > because the lookup tables required are like 20 MiB even when optimized > for space. I think Python has them . But then, on platforms with decent package management, that data can be shared with other installed packages that require it as well. > Plus you shouldn't normalize text other than performing actions like > substring matching, equality tests, or sorting---and even if you > normalize when performing those, *when possible* you should store the > unnormalized original for display/output afterwards. I thought it was always safe to store decomposed versions of everything.