From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "G.B." Newsgroups: comp.lang.ada Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8 Date: Wed, 19 Oct 2016 16:20:12 +0200 Organization: A noiseless patient Spider Message-ID: References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com> Reply-To: nonlegitur@futureapps.de Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 19 Oct 2016 14:19:53 -0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="5ab09b343b60297e09d9d0458c5b067e"; logging-data="27780"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX186b+angWQdzJJRuIdM407SPeqIUpgkfow=" User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 In-Reply-To: Cancel-Lock: sha1:qekg7t44500zR9neRxhDpZ7CLd0= Xref: news.eternal-september.org comp.lang.ada:32136 Date: 2016-10-19T16:20:12+02:00 List-Id: On 19.10.16 10:49, Dmitry A. Kazakov wrote: >> The misconception is to think that String is meant to be >> Latin-1 String. String isn't Latin-1 String. Ada states >> a *correspondence*, but no essence at all. > > 3.5.2 > > "The predefined type Character is a character type whose values > correspond to the 256 code positions of Row 00 (also known as Latin-1) ^^^^^^^^^^ > of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP)." Exactly, it means, values aren't Latin-1, they correspond to Latin-1 code points. (To be /= To correspond to.) >>>> To get a subset U from a set S, you apply a constraint >>>> to S. That's not (easily) expressible in Ada in this case. >>> >>> There is no such constraint at all. A-umlaut in Latin-1 is one >>> character, in UTF-8 it is two characters. A-Umlaut is a character, not a character-in-Some-Encoding-Form. '€' is one, too, as are the four in "Łódź" that the man named "Artiñano" (8 characters) could not manage to type into his letter without accidentally spoiling his last name. >> In Ada, A-Umlaut is not a character in Latin-1, > > It is. ISO/IEC 8859-1 For Ada, A-Umlaut is ("essence" vs "correspondence") not a character in ISO/IEC 8859-1, but there exist correspondences between A-Umlaut and the Ada Character and ISO/IEC 8859-1. And we "cannot do it in a typed way, that is the whole point". > it is merely a deficiency of Ada type system to do it properly. > We cannot do it with generics or constrained subtypes, so we drop typing > to have at least something. Ada can add a constraining aspect to a type derived from String so as to formally specify the set of values in that type. In a way similar to type US_Elevator is new Integer range -10 .. 500 with Static_Predicate => US_Elevator /= 13; The short, informal name of that computable, exact specification by a Predicate for the former type derived from String is "UTF-8". It gives one-way substitutability: you can use a value of the derived type wherever you can use a value of type String, if there ever is a need for doing so (e.g. dumb String'Write can be reused after Convert-ing to UTF_8_String (encoding)).