From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail From: "Dmitry A. Kazakov" Newsgroups: comp.lang.ada Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8 Date: Wed, 19 Oct 2016 18:20:47 +0200 Organization: Aioe.org NNTP Server Message-ID: References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com> NNTP-Posting-Host: XXXaKfQ6zzC8DMOzOT/pgA.user.gioia.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 X-Notice: Filtered by postfilter v. 0.8.2 Xref: news.eternal-september.org comp.lang.ada:32139 Date: 2016-10-19T18:20:47+02:00 List-Id: On 2016-10-19 16:20, G.B. wrote: > On 19.10.16 10:49, Dmitry A. Kazakov wrote: > >>> The misconception is to think that String is meant to be >>> Latin-1 String. String isn't Latin-1 String. Ada states >>> a *correspondence*, but no essence at all. >> >> 3.5.2 >> >> "The predefined type Character is a character type whose values >> correspond to the 256 code positions of Row 00 (also known as Latin-1) > ^^^^^^^^^^ >> of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP)." > > Exactly, it means, values aren't Latin-1, they correspond > to Latin-1 code points. (To be /= To correspond to.) They are. The language is necessarily sloppy for the sake of simplicity. Values are Latin-1. The corresponding language character objects (which are customary called "values" too) correspond, represent these values. There is no reason to distinguish language values and problem space values they represent so long there is no confusion. Anyway it does not change anything in the discussion. Same objects of String and UTF-8 Strings correspond/represent different problem space values. Sameness is defined as equality "=". >>>>> To get a subset U from a set S, you apply a constraint >>>>> to S. That's not (easily) expressible in Ada in this case. >>>> >>>> There is no such constraint at all. A-umlaut in Latin-1 is one >>>> character, in UTF-8 it is two characters. > > A-Umlaut is a character, not a character-in-Some-Encoding-Form. The text you quote states exactly that. >>> In Ada, A-Umlaut is not a character in Latin-1, >> >> It is. ISO/IEC 8859-1 > > For Ada, A-Umlaut is ("essence" vs "correspondence") not > a character in ISO/IEC 8859-1, but there exist correspondences > between A-Umlaut and the Ada Character and ISO/IEC 8859-1. Ada character objects represent characters defined in ISO/IEC 8859-1. For each object there is one and only one ISO/IEC 8859-1 character and conversely for each ISO/IEC 8859-1 character there one and only one Ada character value. > And we "cannot do it in a typed way, that is the whole point". > >> it is merely a deficiency of Ada type system to do it properly. >> We cannot do it with generics or constrained subtypes, so we drop typing >> to have at least something. > > Ada can add a constraining aspect to a type derived from String > so as to formally specify the set of values in that type. That won't be a string subtype, a property considered more important than being a proper subtype. There is no language subtype that could represent a subtype relationship between sequences of *same* characters having *different* encoding (representation). Which is the essence of the problem. > The short, informal name of that computable, exact specification > by a Predicate for the former type derived from String is "UTF-8". > > It gives one-way substitutability: you can use a value of the > derived type wherever you can use a value of type String, if > there ever is a need for doing so (e.g. dumb String'Write can be > reused after Convert-ing to UTF_8_String (encoding)). See the example with A-umlaut illustrates. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de