From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: "G.B." <bauhaus@futureapps.invalid>
Newsgroups: comp.lang.ada
Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8
Date: Wed, 19 Oct 2016 16:20:12 +0200
Organization: A noiseless patient Spider
Message-ID: <nu7ve9$r44$1@dont-email.me>
References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com>
 <nu3mkc$agg$1@dont-email.me> <nu4jnj$11va$1@gioia.aioe.org>
 <nu4m5k$g7g$1@dont-email.me> <nu4nee$18le$1@gioia.aioe.org>
 <nu4sbm$4m3$1@dont-email.me> <nu54af$1oo$1@gioia.aioe.org>
 <nu5e0p$54t$1@dont-email.me> <nu5j0s$sch$1@gioia.aioe.org>
 <nu5mgi$7er$1@dont-email.me> <nu5v60$1h81$1@gioia.aioe.org>
 <nu7a3b$i94$1@dont-email.me> <nu7c29$1dsi$1@gioia.aioe.org>
Reply-To: nonlegitur@futureapps.de
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 19 Oct 2016 14:19:53 -0000 (UTC)
Injection-Info: mx02.eternal-september.org;
 posting-host="5ab09b343b60297e09d9d0458c5b067e";
	logging-data="27780"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX186b+angWQdzJJRuIdM407SPeqIUpgkfow="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.4.0
In-Reply-To: <nu7c29$1dsi$1@gioia.aioe.org>
Cancel-Lock: sha1:qekg7t44500zR9neRxhDpZ7CLd0=
Xref: news.eternal-september.org comp.lang.ada:32136
Date: 2016-10-19T16:20:12+02:00
List-Id: <comp.lang.ada>

On 19.10.16 10:49, Dmitry A. Kazakov wrote:

>> The misconception is to think that String is meant to be
>> Latin-1 String. String isn't Latin-1 String. Ada states
>> a *correspondence*, but no essence at all.
>
> 3.5.2
>
> "The predefined type Character is a character type whose values
> correspond to the 256 code positions of Row 00 (also known as Latin-1)
   ^^^^^^^^^^
> of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP)."

Exactly, it means, values aren't Latin-1, they correspond
to Latin-1 code points. (To be /= To correspond to.)


>>>> To get a subset U from a set S, you apply a constraint
>>>> to S. That's not (easily) expressible in Ada in this case.
>>>
>>> There is no such constraint at all. A-umlaut in Latin-1 is one
>>> character, in UTF-8 it is two characters.

A-Umlaut is a character, not a character-in-Some-Encoding-Form.
'€' is one, too, as are the four in "Łódź" that the man named
"Artiñano" (8 characters) could not manage to type into his letter
without accidentally spoiling his last name.


>> In Ada, A-Umlaut is not a character in Latin-1,
>
> It is. ISO/IEC 8859-1

For Ada, A-Umlaut is ("essence" vs "correspondence") not
a character in ISO/IEC 8859-1, but there exist correspondences
between A-Umlaut and the Ada Character and ISO/IEC 8859-1.
And we "cannot do it in a typed way, that is the whole point".

> it is merely a deficiency of Ada type system to do it properly.
> We cannot do it with generics or constrained subtypes, so we drop typing
> to have at least something.

Ada can add a constraining aspect to a type derived from String
so as to formally specify the set of values in that type.
In a way similar to

   type US_Elevator is new Integer range -10 .. 500
      with
        Static_Predicate => US_Elevator /= 13;

The short, informal name of that computable, exact specification
by a Predicate for the former type derived from String is "UTF-8".

It gives one-way substitutability: you can use a value of the
derived type wherever you can use a value of type String, if
there ever is a need for doing so (e.g. dumb String'Write can be
reused after Convert-ing to UTF_8_String (encoding)).