comp.lang.ada
 help / color / mirror / Atom feed
From: "G.B." <bauhaus@futureapps.invalid>
Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8
Date: Tue, 18 Oct 2016 17:10:35 +0200
Date: 2016-10-18T17:10:35+02:00	[thread overview]
Message-ID: <nu5e0p$54t$1@dont-email.me> (raw)
In-Reply-To: <nu54af$1oo$1@gioia.aioe.org>

On 18.10.16 14:24, Dmitry A. Kazakov wrote:

>> still, any UTF-8 encoded "string" of UCS objects is wellformed
>> and it satisfies a predicate that involves all components x, x', x'', ...
>> of a UTF_8_String object, by stating that if x matches 2#10......#,
>> then x' is such-and-such, and so on. I'm not sure this predicate
>> is easily stated as a stand-alone type invariant, for example, but
>> that's the idea. It shouldn't have to be visible to Ada programmers.
>
> Sorry, that is a meaningless set of words.

Spelling out the look of model strings for type UTF_8_String
cannot quite be meaningless.

Ada Rationale:

"Type invariants are designed for use with private types
  where we want some relationship to always hold between
  components of the type". (2.4)

We want some relationship to always hold between components
of type UTF_8_String (if it were private, so that 2.4 might
formally apply):

   type UTF_Rep_Text is ... with Type_Invariant =>
     (...
    (case UTF_Rep_Text (K) is
     when 2#10_000000# .. 2#10_111111# =>
       (case UTF_Rep_Text (K + 1) is
         when 2#1_0000000# .. 2#1_1111111# =>  ...))
     ...);

> Type constraint is put on type values.

(Type values or a type's values?)

AI-05-0146: "invariants apply to all values of a type, while constraints
are generally used to identify a subset of the values of a type".

UTF_8_String does identify a subset of the values of
type String, by intent, even if it takes more of the RM to see that:

Since Strings are dumb insofar as they allow every value of
type Character as a component, a string that is a well-formed
UTF-8 sequence U of octets---each octet appears as a Characters---is in
a subset of type String's. All are of finite length. These well-formed
sequences U establish a subset of all possible String values.
Call it UTF_8_String, not Unicode_String, nor UCS_String.

As said, I don't think that the set's predicate is easy to state.
With an aspect stating it, a purpoted UTF_8_String value that isn't
will be dropped from the set, perhaps as loudly as raising
Encoding_Error will be now.

> Values of UTF-8 strings are not values of strings, as A-umlaut promptly demonstrates. Period.

Of course they can be (ASCII subset). Also, the UTF_String thing is just
a vague expression of what RM A.4.11(47/3) states mores specifically,
for going from this representation oriented subtype to "real"
characters from the UCS.
But this is not otherwise reflected in the subtype, AFAICS.
Considering

   UTF_String'("'Ä' is A-Umlaut");

the literal, if taken at face value. doesn't say which characters
there are going to be. It takes an Ada compiler to interpret the
source text and decide whether it is representing Latin-1 or
a multi-octed sequence, possibly one that needs Wide_Character
or Wide_Wide_Character.


> "Remainders are values ... in Character" makes no sense either.

   Character'Val (N rem 256);


-- 
"HOTDOGS ARE NOT BOOKMARKS"
Springfield Elementary teaching staff


  reply	other threads:[~2016-10-18 15:10 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-17 20:18 Bug in Ada - Latin 1 is not a subset of UTF-8 Lucretia
2016-10-17 20:57 ` Jacob Sparre Andersen
2016-10-18  5:44   ` J-P. Rosen
2016-10-17 23:25 ` G.B.
2016-10-18  7:41   ` Dmitry A. Kazakov
2016-10-18  8:23     ` G.B.
2016-10-18  8:45       ` Dmitry A. Kazakov
2016-10-18 10:09         ` G.B.
2016-10-18 12:24           ` Dmitry A. Kazakov
2016-10-18 15:10             ` G.B. [this message]
2016-10-18 16:35               ` Dmitry A. Kazakov
2016-10-18 17:35                 ` G.B.
2016-10-18 20:03                   ` Dmitry A. Kazakov
2016-10-19  8:15                     ` G.B.
2016-10-19  8:25                       ` G.B.
2016-10-19  8:49                       ` Dmitry A. Kazakov
2016-10-19 14:20                         ` G.B.
2016-10-19 16:20                           ` Dmitry A. Kazakov
2016-10-20  0:31         ` Randy Brukardt
2016-10-20  7:36           ` Dmitry A. Kazakov
2016-10-21 12:28             ` G.B.
2016-10-21 16:13               ` Lucretia
2016-10-21 16:43                 ` Dmitry A. Kazakov
2016-10-22  5:51                   ` G.B.
2016-10-22  7:49                     ` Dmitry A. Kazakov
2016-10-24 11:35                       ` Luke A. Guest
2016-10-24 13:01                         ` Dmitry A. Kazakov
2016-10-24 14:54                           ` Luke A. Guest
2016-10-22  1:53             ` Randy Brukardt
2016-10-28 21:08         ` Shark8
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox