From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on ip-172-31-91-241.ec2.internal X-Spam-Level: X-Spam-Status: No, score=0.0 required=3.0 tests=none autolearn=ham autolearn_force=no version=4.0.1 Path: nntp.eternal-september.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Lawrence =?iso-8859-13?q?D=FFOliveiro?= Newsgroups: comp.lang.ada,fr.comp.lang.ada Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling) Date: Tue, 2 Sep 2025 22:56:12 -0000 (UTC) Organization: A noiseless patient Spider Message-ID: <1097smc$qe34$5@dont-email.me> References: <10974d1$jn0e$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Tue, 02 Sep 2025 22:56:13 +0000 (UTC) Injection-Info: dont-email.me; posting-host="bcaaf5f734cb0c4953edc4a46666cd0d"; logging-data="866404"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/WrqMhC+61bHkNErp+QV5L" User-Agent: Pan/0.163 (Kryvyi Rih) Cancel-Lock: sha1:w7Lkz7V9IoKYtJm+tyJQ/DE/WOk= Xref: feeder.eternal-september.org comp.lang.ada:67025 fr.comp.lang.ada:2361 List-Id: On Tue, 2 Sep 2025 10:01:34 -0600, Alex // nytpu wrote: > ... (UCS-4 has a number of additional differences from UTF-32 > regarding "valid encodings", namely that all valid Unicode > codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only > Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive) > are valid in UTF-32) ... So what do those codes mean in UCS-4? > ... and are missing some additional information: a key detail is > that even with UTF-32 where each Unicode scalar value is held in one > array element rather than being variable-width like UTF-8/UTF-16, > you still can't treat them as arbitrary arrays like 7-bit ASCII > because a grapheme can be made up of multiple Unicode scalar values. > Even with ASCII characters there's the possibility of combining > diacritics or such that would break if you split the string between > them. This is why you have “normalization”.