From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8
Date: Tue, 18 Oct 2016 18:35:42 +0200
Organization: Aioe.org NNTP Server
Message-ID: <nu5j0s$sch$1@gioia.aioe.org>
References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com>
 <nu3mkc$agg$1@dont-email.me> <nu4jnj$11va$1@gioia.aioe.org>
 <nu4m5k$g7g$1@dont-email.me> <nu4nee$18le$1@gioia.aioe.org>
 <nu4sbm$4m3$1@dont-email.me> <nu54af$1oo$1@gioia.aioe.org>
 <nu5e0p$54t$1@dont-email.me>
NNTP-Posting-Host: XXXaKfQ6zzC8DMOzOT/pgA.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.4.0
X-Notice: Filtered by postfilter v. 0.8.2
Xref: news.eternal-september.org comp.lang.ada:32124
Date: 2016-10-18T18:35:42+02:00
List-Id: <comp.lang.ada>

On 2016-10-18 17:10, G.B. wrote:
> On 18.10.16 14:24, Dmitry A. Kazakov wrote:
>
>>> still, any UTF-8 encoded "string" of UCS objects is wellformed
>>> and it satisfies a predicate that involves all components x, x', x'',
>>> ...
>>> of a UTF_8_String object, by stating that if x matches 2#10......#,
>>> then x' is such-and-such, and so on. I'm not sure this predicate
>>> is easily stated as a stand-alone type invariant, for example, but
>>> that's the idea. It shouldn't have to be visible to Ada programmers.
>>
>> Sorry, that is a meaningless set of words.
>
> Spelling out the look of model strings for type UTF_8_String
> cannot quite be meaningless.
>
> Ada Rationale:
>
> "Type invariants are designed for use with private types
>  where we want some relationship to always hold between
>  components of the type". (2.4)

That is completely irrelevant. No invariant can make Latin-1 A-umlaut 
UTF-8 A-umlaut.

>> Type constraint is put on type values.
>
> (Type values or a type's values?)

Values of a type, E.g. Positive is constrained Integer.

> UTF_8_String does identify a subset of the values of
> type String, by intent,

No, it does not, that is why this implementation is broken. UTF-8 
strings can be represented by String, they can be represented by Boolean 
arrays or by indefinite integers or by polygons. That does not make a 
them Boolean array subtype. No way.

>> Values of UTF-8 strings are not values of strings, as A-umlaut
>> promptly demonstrates. Period.
>
> Of course they can be (ASCII subset).

A-umlaut is not ASCII.

> But this is not otherwise reflected in the subtype, AFAICS.
> Considering
>
>   UTF_String'("'Ä' is A-Umlaut");
>
> the literal, if taken at face value. doesn't say which characters
> there are going to be.

It does exactly this, once you define "character".

> It takes an Ada compiler to interpret the
> source text and decide whether it is representing Latin-1 or
> a multi-octed sequence, possibly one that needs Wide_Character
> or Wide_Wide_Character.

There is nothing to interpret considering literals of Universal_String. 
It is no different from the way Universal_Integer is handled. String and 
UTF-8 string and Wide string can be considered subtypes of 
Universal_String, that does not have effect on the relationships between 
String and UTF-8 string. Same if literals considered overloaded 
functions. No different.

>> "Remainders are values ... in Character" makes no sense either.
>
>   Character'Val (N rem 256);

So what? Numeric characters is still a constrained subtype of character 
type.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de