From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8
Date: Wed, 19 Oct 2016 10:49:14 +0200
Organization: Aioe.org NNTP Server
Message-ID: <nu7c29$1dsi$1@gioia.aioe.org>
References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com>
 <nu3mkc$agg$1@dont-email.me> <nu4jnj$11va$1@gioia.aioe.org>
 <nu4m5k$g7g$1@dont-email.me> <nu4nee$18le$1@gioia.aioe.org>
 <nu4sbm$4m3$1@dont-email.me> <nu54af$1oo$1@gioia.aioe.org>
 <nu5e0p$54t$1@dont-email.me> <nu5j0s$sch$1@gioia.aioe.org>
 <nu5mgi$7er$1@dont-email.me> <nu5v60$1h81$1@gioia.aioe.org>
 <nu7a3b$i94$1@dont-email.me>
NNTP-Posting-Host: vZYCW951TbFitc4GdEwQJg.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.4.0
X-Notice: Filtered by postfilter v. 0.8.2
Xref: news.eternal-september.org comp.lang.ada:32133
Date: 2016-10-19T10:49:14+02:00
List-Id: <comp.lang.ada>

On 19/10/2016 10:15, G.B. wrote:
> On 18.10.16 22:03, Dmitry A. Kazakov wrote:
>> On 2016-10-18 19:35, G.B. wrote:
>>> On 18.10.16 18:35, Dmitry A. Kazakov wrote:
>>>> No invariant can make Latin-1 A-umlaut UTF-8 A-umlaut.
>>>
>>> Who would ever want to do that?
>>
>> Somebody claiming that UTF-8 string is a constrained subtype of
>> Latin-1 string.
>
> But I do not claim this!
>
> The misconception is to think that String is meant to be
> Latin-1 String. String isn't Latin-1 String. Ada states
> a *correspondence*, but no essence at all.

3.5.2

"The predefined type Character is a character type whose values 
correspond to the 256 code positions of Row 00 (also known as Latin-1) 
of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP)."

String means Latin-1. You can use it as if it meant something else, e.g. 
UTF-8 string or UCS-2 string or PDP-11 machine code. That would prove 
nothing except your willingness to go untyped.

> In fact, reading Japanese, or Polish, or Hebrew text would
> be impossible to do in Ada if String was Latin-1!

Polish alphabet is Latin based, BTW.

Yes, you need to break the type system in order to re-interpret String 
as a UTF-8 string. You cannot do it in a typed way, that is the whole 
point. Latin-1 and UTF-8 strings are not subtypes unless you break 
types. Once you did it does not make any sense to talk about subtypes 
anymore. Subtype presumes keeping if not all (LSP subtype) but some of 
vital properties. Re-interpreted Latin-1 to UTF-8 strings keep almost 
none of string properties.

>>> To get a subset U from a set S, you apply a constraint
>>> to S. That's not (easily) expressible in Ada in this case.
>>
>> There is no such constraint at all. A-umlaut in Latin-1 is one
>> character, in UTF-8 it is two characters.
>
> In Ada, A-Umlaut is not a character in Latin-1,

It is. ISO/IEC 8859-1

> Where we would be needing conversion, were Ada to have
> types for character sets and so on, we now have operations
> such as Encode, Decode, and Convert.

Yep, Ada goes untyped mess. Again, it is not an ill will to make C out 
of Ada, it is merely a deficiency of Ada type system to do it properly. 
We cannot do it with generics or constrained subtypes, so we drop typing 
to have at least something.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de