From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8
Date: Wed, 19 Oct 2016 18:20:47 +0200
Organization: Aioe.org NNTP Server
Message-ID: <nu86gv$ufu$1@gioia.aioe.org>
References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com>
 <nu3mkc$agg$1@dont-email.me> <nu4jnj$11va$1@gioia.aioe.org>
 <nu4m5k$g7g$1@dont-email.me> <nu4nee$18le$1@gioia.aioe.org>
 <nu4sbm$4m3$1@dont-email.me> <nu54af$1oo$1@gioia.aioe.org>
 <nu5e0p$54t$1@dont-email.me> <nu5j0s$sch$1@gioia.aioe.org>
 <nu5mgi$7er$1@dont-email.me> <nu5v60$1h81$1@gioia.aioe.org>
 <nu7a3b$i94$1@dont-email.me> <nu7c29$1dsi$1@gioia.aioe.org>
 <nu7ve9$r44$1@dont-email.me>
NNTP-Posting-Host: XXXaKfQ6zzC8DMOzOT/pgA.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.4.0
X-Notice: Filtered by postfilter v. 0.8.2
Xref: news.eternal-september.org comp.lang.ada:32139
Date: 2016-10-19T18:20:47+02:00
List-Id: <comp.lang.ada>

On 2016-10-19 16:20, G.B. wrote:
> On 19.10.16 10:49, Dmitry A. Kazakov wrote:
>
>>> The misconception is to think that String is meant to be
>>> Latin-1 String. String isn't Latin-1 String. Ada states
>>> a *correspondence*, but no essence at all.
>>
>> 3.5.2
>>
>> "The predefined type Character is a character type whose values
>> correspond to the 256 code positions of Row 00 (also known as Latin-1)
>   ^^^^^^^^^^
>> of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP)."
>
> Exactly, it means, values aren't Latin-1, they correspond
> to Latin-1 code points. (To be /= To correspond to.)

They are. The language is necessarily sloppy for the sake of simplicity. 
Values are Latin-1. The corresponding language character objects (which 
are customary called "values" too) correspond, represent these values. 
There is no reason to distinguish language values and problem space 
values they represent so long there is no confusion.

Anyway it does not change anything in the discussion. Same objects of 
String and UTF-8 Strings correspond/represent different problem space 
values. Sameness is defined as equality "=".

>>>>> To get a subset U from a set S, you apply a constraint
>>>>> to S. That's not (easily) expressible in Ada in this case.
>>>>
>>>> There is no such constraint at all. A-umlaut in Latin-1 is one
>>>> character, in UTF-8 it is two characters.
>
> A-Umlaut is a character, not a character-in-Some-Encoding-Form.

The text you quote states exactly that.

>>> In Ada, A-Umlaut is not a character in Latin-1,
>>
>> It is. ISO/IEC 8859-1
>
> For Ada, A-Umlaut is ("essence" vs "correspondence") not
> a character in ISO/IEC 8859-1, but there exist correspondences
> between A-Umlaut and the Ada Character and ISO/IEC 8859-1.

Ada character objects represent characters defined in ISO/IEC 8859-1. 
For each object there is one and only one ISO/IEC 8859-1 character and 
conversely for each ISO/IEC 8859-1 character there one and only one Ada 
character value.

> And we "cannot do it in a typed way, that is the whole point".
>
>> it is merely a deficiency of Ada type system to do it properly.
>> We cannot do it with generics or constrained subtypes, so we drop typing
>> to have at least something.
>
> Ada can add a constraining aspect to a type derived from String
> so as to formally specify the set of values in that type.

That won't be a string subtype, a property considered more important 
than being a proper subtype.

There is no language subtype that could represent a subtype relationship 
between sequences of *same* characters having *different* encoding 
(representation). Which is the essence of the problem.

> The short, informal name of that computable, exact specification
> by a Predicate for the former type derived from String is "UTF-8".
>
> It gives one-way substitutability: you can use a value of the
> derived type wherever you can use a value of type String, if
> there ever is a need for doing so (e.g. dumb String'Write can be
> reused after Convert-ing to UTF_8_String (encoding)).

See the example with A-umlaut illustrates.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de