From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "G.B." Newsgroups: comp.lang.ada Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8 Date: Wed, 19 Oct 2016 10:15:58 +0200 Organization: A noiseless patient Spider Message-ID: References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com> Reply-To: nonlegitur@futureapps.de Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Wed, 19 Oct 2016 08:15:39 -0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="dbaaa14e6bed1d902e348fb8e2991c6b"; logging-data="18724"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/v+hct2uLyQWpT+y/YZPgdbTQvZaUAX+0=" User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 In-Reply-To: Cancel-Lock: sha1:5IzMu9BsF3E7uMHpbfgMDZFal8k= Xref: news.eternal-september.org comp.lang.ada:32131 Date: 2016-10-19T10:15:58+02:00 List-Id: On 18.10.16 22:03, Dmitry A. Kazakov wrote: > On 2016-10-18 19:35, G.B. wrote: >> On 18.10.16 18:35, Dmitry A. Kazakov wrote: >>> No invariant can make Latin-1 A-umlaut UTF-8 A-umlaut. >> >> Who would ever want to do that? > > Somebody claiming that UTF-8 string is a constrained subtype of Latin-1 string. But I do not claim this! The misconception is to think that String is meant to be Latin-1 String. String isn't Latin-1 String. Ada states a *correspondence*, but no essence at all. In fact, reading Japanese, or Polish, or Hebrew text would be impossible to do in Ada if String was Latin-1! Yes, character sets in Ada do not have types. >> To get a subset U from a set S, you apply a constraint >> to S. That's not (easily) expressible in Ada in this case. > > There is no such constraint at all. A-umlaut in Latin-1 is one character, in UTF-8 it is two characters. In Ada, A-Umlaut is not a character in Latin-1, In Ada, A-Umlaut is not a character in UTF-8. Reason: Latin-1 and UTF-8 describe encoded forms, as do KOI8-R, ISO-8859-15, Shift_JIS, or CP 1252. Some only happen to list, and some only indicate a repertoire of corresponding characters also. A-Umlaut is a character, lower case C. > To introduce a subtype relationship we need a conversion, not a constraint. Ada does not support this method of subtype construction. An Ada-subtype relationship is designed to avoid conversion, And so it is distinguishable by its constraint, and its name, only. Where we would be needing conversion, were Ada to have types for character sets and so on, we now have operations such as Encode, Decode, and Convert. Together with statements of correspondence and normative reference in the RM. But both do not prevent identifying a subset of valid values of dumb type String that constitute the subset of UTF_8_String. Or that of a to-be-defined (trivial) subtype Latin_1_String. type Latin_String is String; -- RM blah blah ... type Latin_1_String is String; -- "HOTDOGS ARE NOT BOOKMARKS" Springfield Elementary teaching staff