From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail From: "Dmitry A. Kazakov" Newsgroups: comp.lang.ada Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8 Date: Wed, 19 Oct 2016 10:49:14 +0200 Organization: Aioe.org NNTP Server Message-ID: References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com> NNTP-Posting-Host: vZYCW951TbFitc4GdEwQJg.user.gioia.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 X-Notice: Filtered by postfilter v. 0.8.2 Xref: news.eternal-september.org comp.lang.ada:32133 Date: 2016-10-19T10:49:14+02:00 List-Id: On 19/10/2016 10:15, G.B. wrote: > On 18.10.16 22:03, Dmitry A. Kazakov wrote: >> On 2016-10-18 19:35, G.B. wrote: >>> On 18.10.16 18:35, Dmitry A. Kazakov wrote: >>>> No invariant can make Latin-1 A-umlaut UTF-8 A-umlaut. >>> >>> Who would ever want to do that? >> >> Somebody claiming that UTF-8 string is a constrained subtype of >> Latin-1 string. > > But I do not claim this! > > The misconception is to think that String is meant to be > Latin-1 String. String isn't Latin-1 String. Ada states > a *correspondence*, but no essence at all. 3.5.2 "The predefined type Character is a character type whose values correspond to the 256 code positions of Row 00 (also known as Latin-1) of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP)." String means Latin-1. You can use it as if it meant something else, e.g. UTF-8 string or UCS-2 string or PDP-11 machine code. That would prove nothing except your willingness to go untyped. > In fact, reading Japanese, or Polish, or Hebrew text would > be impossible to do in Ada if String was Latin-1! Polish alphabet is Latin based, BTW. Yes, you need to break the type system in order to re-interpret String as a UTF-8 string. You cannot do it in a typed way, that is the whole point. Latin-1 and UTF-8 strings are not subtypes unless you break types. Once you did it does not make any sense to talk about subtypes anymore. Subtype presumes keeping if not all (LSP subtype) but some of vital properties. Re-interpreted Latin-1 to UTF-8 strings keep almost none of string properties. >>> To get a subset U from a set S, you apply a constraint >>> to S. That's not (easily) expressible in Ada in this case. >> >> There is no such constraint at all. A-umlaut in Latin-1 is one >> character, in UTF-8 it is two characters. > > In Ada, A-Umlaut is not a character in Latin-1, It is. ISO/IEC 8859-1 > Where we would be needing conversion, were Ada to have > types for character sets and so on, we now have operations > such as Encode, Decode, and Convert. Yep, Ada goes untyped mess. Again, it is not an ill will to make C out of Ada, it is merely a deficiency of Ada type system to do it properly. We cannot do it with generics or constrained subtypes, so we drop typing to have at least something. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de