From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on ip-172-31-91-241.ec2.internal X-Spam-Level: X-Spam-Status: No, score=0.0 required=3.0 tests=none autolearn=ham autolearn_force=no version=4.0.1 Path: nntp.eternal-september.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Alex // nytpu Newsgroups: comp.lang.ada,fr.comp.lang.ada Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling) Date: Wed, 3 Sep 2025 11:25:10 -0600 Organization: A noiseless patient Spider Message-ID: <1099tlm$19g42$1@dont-email.me> References: <10974d1$jn0e$1@dont-email.me> <1097smc$qe34$5@dont-email.me> <10981js$rmq1$1@dont-email.me> <1098f46$u638$2@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 03 Sep 2025 17:25:11 +0000 (UTC) Injection-Info: dont-email.me; posting-host="9dccf2af25a5c70fac7f16ae550e4f98"; logging-data="1360002"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/UZtoyPy9P8lU+KKD3prdg" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:+l+TEtdeMmdzK1LYM+SkGqHkFB8= Content-Language: en-US, en-US-large In-Reply-To: <1098f46$u638$2@dont-email.me> Xref: feeder.eternal-september.org comp.lang.ada:67028 fr.comp.lang.ada:2364 List-Id: On 9/2/25 10:10 PM, Lawrence D’Oliveiro wrote: > I gather the basic problem was that Unicode was originally going to be a > fixed-length 16-bit code, and that was that. And so early adopters > (Windows NT and Java among them), built UCS-2 right into their DNA. > > Until Unicode 2.0, I believe it was, where they went “on second thought, > let’s go beyond our original brief and start including all kinds of other > things as well” ... and UCS-2 had to become UTF-16 ... Yeah, they started with UCS-2 (as the only encoding) because they thought that 2^16 characters would be enough but then a few years later realized they'd run out extremely quickly even sticking solely with actively-used languages and with the very controversial Han unification, so they had to hack together the surrogate pairs to allow multiple planes (and at the same time they were developing UTF-8 for its desirable compatibility with 7-bit ASCII so they had to stick with UTF-16's limitations since the other encodings came later). >> Plus conveniently Ada doesn't have routines for normalization, but can't >> hold that against it since neither does any other programming language >> because the lookup tables required are like 20 MiB even when optimized >> for space. > > I think Python has them > . But then, on > platforms with decent package management, that data can be shared with > other installed packages that require it as well. Yeah, although it's a language that is expected to have one global runtime used by everything; anything that's compiled (with or without a bundled runtime, e.g. Go) doesn't want to impose a mandatory 20 MiB overhead in every executable for something that's you can *usually* get away with not using (see also the LUTs for Unicode character classes). >> Plus you shouldn't normalize text other than performing actions like >> substring matching, equality tests, or sorting---and even if you >> normalize when performing those, *when possible* you should store the >> unnormalized original for display/output afterwards. > > I thought it was always safe to store decomposed versions of everything. Well, it depends; storing decomposed (NFD, NFKD) versions is acceptable IIRC (maybe not because I think it still does some limited substitution for "visually similar" characters, just less extreme) but usually pointless if you don't need to inspect the contents. Or if you're storing like, a search index, then also yeah you should store normalized (NFC, NFKC) versions of strings. But in general just keep the original form of things unless you need to inspect/compare the contents (and if you don't need to regularly inspect the contents then just convert it when needed instead of storing the normalized versions). Just my opinion though, there's arguments either way, I just don't like needlessly messing with the semantics of the input data. ~nytpu -- Alex // nytpu https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/