Bug in Ada - Latin 1 is not a subset of UTF-8

comp.lang.ada
 help / color / mirror / Atom feed

* Bug in Ada - Latin 1 is not a subset of  UTF-8
@ 2016-10-17 20:18 Lucretia
  2016-10-17 20:57 ` Jacob Sparre Andersen
  2016-10-17 23:25 ` G.B.
  0 siblings, 2 replies; 30+ messages in thread
From: Lucretia @ 2016-10-17 20:18 UTC (permalink / raw)


Hi,

Whilst binding SDL_TTF function, I was going to Overload the TTF_Size* functions, but I couldn't do that because UTF_8_String is a subtype of String; String is Latin 1 and Latin 1 is not a subset of UTF-8, ASCII is.

UTF_String should be implemented as an array like String and then UTF_8_String should be a subtype of UTF_String or a renaming, if that is the intent.

Luke.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of  UTF-8
  2016-10-17 20:18 Bug in Ada - Latin 1 is not a subset of UTF-8 Lucretia
@ 2016-10-17 20:57 ` Jacob Sparre Andersen
  2016-10-18  5:44   ` J-P. Rosen
  2016-10-17 23:25 ` G.B.
  1 sibling, 1 reply; 30+ messages in thread
From: Jacob Sparre Andersen @ 2016-10-17 20:57 UTC (permalink / raw)


Lucretia wrote:

> Whilst binding SDL_TTF function, I was going to Overload the TTF_Size*
> functions, but I couldn't do that because UTF_8_String is a subtype of
> String; String is Latin 1 and Latin 1 is not a subset of UTF-8, ASCII
> is.
>
> UTF_String should be implemented as an array like String and then
> UTF_8_String should be a subtype of UTF_String or a renaming, if that
> is the intent.

I think the best you can do is to ignore the subtypes declared in
Ada.Strings.UTF_Encoding (as they are just plain wrong), and declare
your own type for storing UTF-8 encoded strings.

Greetings,

Jacob
-- 
"There are only two types of data:
                         Data which has been backed up
                         Data which has not been lost - yet"

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-17 20:18 Bug in Ada - Latin 1 is not a subset of UTF-8 Lucretia
  2016-10-17 20:57 ` Jacob Sparre Andersen
@ 2016-10-17 23:25 ` G.B.
  2016-10-18  7:41   ` Dmitry A. Kazakov
  1 sibling, 1 reply; 30+ messages in thread
From: G.B. @ 2016-10-17 23:25 UTC (permalink / raw)


On 17.10.16 22:18, Lucretia wrote:
> Hi,
>
> Whilst binding SDL_TTF function, I was going to Overload the TTF_Size* functions, but I couldn't do that because UTF_8_String is a subtype of String; String is Latin 1 and Latin 1 is not a subset of UTF-8, ASCII is.
>
> UTF_String should be implemented as an array like String and then UTF_8_String should be a subtype of UTF_String or a renaming, if that is the intent.
>

According to ISO 10646, UTF stands for UCS Transformation
Format. So, it's a format, suggesting a representation.

On similar grounds, one could define a string subtype for
other types of objects, for example

   subtype Number_String is String;

The components represent the bits of the octets of the numbers
(base 256) in sequence, of whole numbers assumed to be
phone numbers. Each phone number is headed by a plus sign.

So, calling a taxi by telephone in Berlin, Dublin, or Ho Chi Minh
City might be helped by turning the string

   "+^@^A%??+^@^H9^K-+^@^S??^C"

into the respective numbers.

The intent, I guess, of UTF_String and its kin is to facilitate
reading and writing items of UCS.


-- 
"HOTDOGS ARE NOT BOOKMARKS"
Springfield Elementary teaching staff


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-17 20:57 ` Jacob Sparre Andersen
@ 2016-10-18  5:44   ` J-P. Rosen
  0 siblings, 0 replies; 30+ messages in thread
From: J-P. Rosen @ 2016-10-18  5:44 UTC (permalink / raw)


Le 17/10/2016 à 22:57, Jacob Sparre Andersen a écrit :
>> UTF_String should be implemented as an array like String and then
>> > UTF_8_String should be a subtype of UTF_String or a renaming, if that
>> > is the intent.
> I think the best you can do is to ignore the subtypes declared in
> Ada.Strings.UTF_Encoding (as they are just plain wrong), and declare
> your own type for storing UTF-8 encoded strings.

FWIW, the issue of whether to make UTF-8 a different type or a subtype
of String was discussed at the ARG. It was decided to make a subtype
basically on the grounds that:
1) In most cases, you need to read the beginning of a file (presumably
with Text_IO) before you decide whether it is UTF-8 or not
2) We feared that with a separate type, people would complain that "once
again, Ada does it differently than other languages", and that it would
involve many type conversions for no real benefit.

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-17 23:25 ` G.B.
@ 2016-10-18  7:41   ` Dmitry A. Kazakov
  2016-10-18  8:23     ` G.B.
  0 siblings, 1 reply; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-18  7:41 UTC (permalink / raw)

On 18/10/2016 01:25, G.B. wrote:
> On 17.10.16 22:18, Lucretia wrote:

> According to ISO 10646, UTF stands for UCS Transformation
> Format. So, it's a format, suggesting a representation.
>
> On similar grounds, one could define a string subtype for
> other types of objects, for example
>
>   subtype Number_String is String;

You are wrong. String of numeric characters is not an encoding, it is a 
constraint = (def) each instance of numeric string is a string. [An 
example of encoding (= representation) is IEEE 754 vs IBM 360 float.]

UTF-8 string is not a constrained string and conversely string is not a 
constrained UTF-8 string.

These are two distinct types which values (some of them) overlap and can 
be converted into each other. The latter allows making them subtypes, 
but Ada language lacks means for that.

In Ada a subtype can either be a constraint (AKA "Ada subtype") or class 
member / class-wide. UTF-8 is not a constraint and String is not tagged.

The decision to force UTF-8 string and string [Latin-1 string to be 
precise] to be subtypes in Ada sense is the least of two evils. It is 
bad and wrong, but the alternative would be only worse.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18  7:41   ` Dmitry A. Kazakov
@ 2016-10-18  8:23     ` G.B.
  2016-10-18  8:45       ` Dmitry A. Kazakov
  0 siblings, 1 reply; 30+ messages in thread
From: G.B. @ 2016-10-18  8:23 UTC (permalink / raw)


On 18.10.16 09:41, Dmitry A. Kazakov wrote:
> On 18/10/2016 01:25, G.B. wrote:
>> On 17.10.16 22:18, Lucretia wrote:
>
>> According to ISO 10646, UTF stands for UCS Transformation
>> Format. So, it's a format, suggesting a representation.
>>
>> On similar grounds, one could define a string subtype for
>> other types of objects, for example
>>
>>   subtype Number_String is String;
>
> You are wrong.

The constraints on either UTF_String or or Number_String are
not expressible as simple Ada subtypes. They are given by
description and normative reference, respectively.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18  8:23     ` G.B.
@ 2016-10-18  8:45       ` Dmitry A. Kazakov
  2016-10-18 10:09         ` G.B.
                           ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-18  8:45 UTC (permalink / raw)


On 18/10/2016 10:23, G.B. wrote:
> On 18.10.16 09:41, Dmitry A. Kazakov wrote:
>> On 18/10/2016 01:25, G.B. wrote:
>>> On 17.10.16 22:18, Lucretia wrote:
>>
>>> According to ISO 10646, UTF stands for UCS Transformation
>>> Format. So, it's a format, suggesting a representation.
>>>
>>> On similar grounds, one could define a string subtype for
>>> other types of objects, for example
>>>
>>>   subtype Number_String is String;
>>
>> You are wrong.
>
> The constraints on either UTF_String or or Number_String are
> not expressible as simple Ada subtypes. They are given by
> description and normative reference, respectively.

In the case of UTF-8 it is not a constraint. "Ä" has different 
representations as Latin-1 and UTF-8 strings.

Numeric character is a constraint expressible in Ada:

    subtype Numeric is Character range '0'..'9';

Numeric string constraint is not expressible, but it still a constraint.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18  8:45       ` Dmitry A. Kazakov
@ 2016-10-18 10:09         ` G.B.
  2016-10-18 12:24           ` Dmitry A. Kazakov
  2016-10-20  0:31         ` Randy Brukardt
  2016-10-28 21:08         ` Shark8
  2 siblings, 1 reply; 30+ messages in thread
From: G.B. @ 2016-10-18 10:09 UTC (permalink / raw)

On 18.10.16 10:45, Dmitry A. Kazakov wrote:
> On 18/10/2016 10:23, G.B. wrote:
>> On 18.10.16 09:41, Dmitry A. Kazakov wrote:
>>> On 18/10/2016 01:25, G.B. wrote:
>>>> On 17.10.16 22:18, Lucretia wrote:
>>>
>>>> According to ISO 10646, UTF stands for UCS Transformation
>>>> Format. So, it's a format, suggesting a representation.
>>>>
>>>> On similar grounds, one could define a string subtype for
>>>> other types of objects, for example
>>>>
>>>>   subtype Number_String is String;
>>>
>>> You are wrong.
>>
>> The constraints on either UTF_String or or Number_String are
>> not expressible as simple Ada subtypes. They are given by
>> description and normative reference, respectively.
>
> In the case of UTF-8 it is not a constraint.

Not an Ada constraint, in particular insofar as UTF-8 means
a representation;
still, any UTF-8 encoded "string" of UCS objects is wellformed
and it satisfies a predicate that involves all components x, x', x'', ...
of a UTF_8_String object, by stating that if x matches 2#10......#,
then x' is such-and-such, and so on. I'm not sure this predicate
is easily stated as a stand-alone type invariant, for example, but
that's the idea. It shouldn't have to be visible to Ada programmers.

>
> Numeric character is a constraint expressible in Ada:
>
>    subtype Numeric is Character range '0'..'9';
>
> Numeric string constraint is not expressible, but it still a constraint.

(Although, the Numeric_String subtype described earlier will have
a meaningless constraint on Numeric, since all remainders
are values both in base 256 and in Character. Come to think of it,
the example format is broken. #-)

-- 
"HOTDOGS ARE NOT BOOKMARKS"
Springfield Elementary teaching staff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18 10:09         ` G.B.
@ 2016-10-18 12:24           ` Dmitry A. Kazakov
  2016-10-18 15:10             ` G.B.
  0 siblings, 1 reply; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-18 12:24 UTC (permalink / raw)


On 18/10/2016 12:09, G.B. wrote:
> On 18.10.16 10:45, Dmitry A. Kazakov wrote:
>> On 18/10/2016 10:23, G.B. wrote:
>>> On 18.10.16 09:41, Dmitry A. Kazakov wrote:
>>>> On 18/10/2016 01:25, G.B. wrote:
>>>>> On 17.10.16 22:18, Lucretia wrote:
>>>>
>>>>> According to ISO 10646, UTF stands for UCS Transformation
>>>>> Format. So, it's a format, suggesting a representation.
>>>>>
>>>>> On similar grounds, one could define a string subtype for
>>>>> other types of objects, for example
>>>>>
>>>>>   subtype Number_String is String;
>>>>
>>>> You are wrong.
>>>
>>> The constraints on either UTF_String or or Number_String are
>>> not expressible as simple Ada subtypes. They are given by
>>> description and normative reference, respectively.
>>
>> In the case of UTF-8 it is not a constraint.
>
> Not an Ada constraint, in particular insofar as UTF-8 means
> a representation;
> still, any UTF-8 encoded "string" of UCS objects is wellformed
> and it satisfies a predicate that involves all components x, x', x'', ...
> of a UTF_8_String object, by stating that if x matches 2#10......#,
> then x' is such-and-such, and so on. I'm not sure this predicate
> is easily stated as a stand-alone type invariant, for example, but
> that's the idea. It shouldn't have to be visible to Ada programmers.

Sorry, that is a meaningless set of words. Type constraint is put on 
type values.

Values of UTF-8 strings are not values of strings, as A-umlaut promptly 
demonstrates. Period.

>> Numeric character is a constraint expressible in Ada:
>>
>>    subtype Numeric is Character range '0'..'9';
>>
>> Numeric string constraint is not expressible, but it still a constraint.
>
> (Although, the Numeric_String subtype described earlier will have
> a meaningless constraint on Numeric, since all remainders
> are values both in base 256 and in Character. Come to think of it,
> the example format is broken. #-)

"Remainders are values ... in Character" makes no sense either.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18 12:24           ` Dmitry A. Kazakov
@ 2016-10-18 15:10             ` G.B.
  2016-10-18 16:35               ` Dmitry A. Kazakov
  0 siblings, 1 reply; 30+ messages in thread
From: G.B. @ 2016-10-18 15:10 UTC (permalink / raw)

On 18.10.16 14:24, Dmitry A. Kazakov wrote:

>> still, any UTF-8 encoded "string" of UCS objects is wellformed
>> and it satisfies a predicate that involves all components x, x', x'', ...
>> of a UTF_8_String object, by stating that if x matches 2#10......#,
>> then x' is such-and-such, and so on. I'm not sure this predicate
>> is easily stated as a stand-alone type invariant, for example, but
>> that's the idea. It shouldn't have to be visible to Ada programmers.
>
> Sorry, that is a meaningless set of words.

Spelling out the look of model strings for type UTF_8_String
cannot quite be meaningless.

Ada Rationale:

"Type invariants are designed for use with private types
  where we want some relationship to always hold between
  components of the type". (2.4)

We want some relationship to always hold between components
of type UTF_8_String (if it were private, so that 2.4 might
formally apply):

   type UTF_Rep_Text is ... with Type_Invariant =>
     (...
    (case UTF_Rep_Text (K) is
     when 2#10_000000# .. 2#10_111111# =>
       (case UTF_Rep_Text (K + 1) is
         when 2#1_0000000# .. 2#1_1111111# =>  ...))
     ...);

> Type constraint is put on type values.

(Type values or a type's values?)

AI-05-0146: "invariants apply to all values of a type, while constraints
are generally used to identify a subset of the values of a type".

UTF_8_String does identify a subset of the values of
type String, by intent, even if it takes more of the RM to see that:

Since Strings are dumb insofar as they allow every value of
type Character as a component, a string that is a well-formed
UTF-8 sequence U of octets---each octet appears as a Characters---is in
a subset of type String's. All are of finite length. These well-formed
sequences U establish a subset of all possible String values.
Call it UTF_8_String, not Unicode_String, nor UCS_String.

As said, I don't think that the set's predicate is easy to state.
With an aspect stating it, a purpoted UTF_8_String value that isn't
will be dropped from the set, perhaps as loudly as raising
Encoding_Error will be now.

> Values of UTF-8 strings are not values of strings, as A-umlaut promptly demonstrates. Period.

Of course they can be (ASCII subset). Also, the UTF_String thing is just
a vague expression of what RM A.4.11(47/3) states mores specifically,
for going from this representation oriented subtype to "real"
characters from the UCS.
But this is not otherwise reflected in the subtype, AFAICS.
Considering

   UTF_String'("'Ä' is A-Umlaut");

the literal, if taken at face value. doesn't say which characters
there are going to be. It takes an Ada compiler to interpret the
source text and decide whether it is representing Latin-1 or
a multi-octed sequence, possibly one that needs Wide_Character
or Wide_Wide_Character.

> "Remainders are values ... in Character" makes no sense either.

   Character'Val (N rem 256);

-- 
"HOTDOGS ARE NOT BOOKMARKS"
Springfield Elementary teaching staff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18 15:10             ` G.B.
@ 2016-10-18 16:35               ` Dmitry A. Kazakov
  2016-10-18 17:35                 ` G.B.
  0 siblings, 1 reply; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-18 16:35 UTC (permalink / raw)

On 2016-10-18 17:10, G.B. wrote:
> On 18.10.16 14:24, Dmitry A. Kazakov wrote:
>
>>> still, any UTF-8 encoded "string" of UCS objects is wellformed
>>> and it satisfies a predicate that involves all components x, x', x'',
>>> ...
>>> of a UTF_8_String object, by stating that if x matches 2#10......#,
>>> then x' is such-and-such, and so on. I'm not sure this predicate
>>> is easily stated as a stand-alone type invariant, for example, but
>>> that's the idea. It shouldn't have to be visible to Ada programmers.
>>
>> Sorry, that is a meaningless set of words.
>
> Spelling out the look of model strings for type UTF_8_String
> cannot quite be meaningless.
>
> Ada Rationale:
>
> "Type invariants are designed for use with private types
>  where we want some relationship to always hold between
>  components of the type". (2.4)

That is completely irrelevant. No invariant can make Latin-1 A-umlaut 
UTF-8 A-umlaut.

>> Type constraint is put on type values.
>
> (Type values or a type's values?)

Values of a type, E.g. Positive is constrained Integer.

> UTF_8_String does identify a subset of the values of
> type String, by intent,

No, it does not, that is why this implementation is broken. UTF-8 
strings can be represented by String, they can be represented by Boolean 
arrays or by indefinite integers or by polygons. That does not make a 
them Boolean array subtype. No way.

>> Values of UTF-8 strings are not values of strings, as A-umlaut
>> promptly demonstrates. Period.
>
> Of course they can be (ASCII subset).

A-umlaut is not ASCII.

> But this is not otherwise reflected in the subtype, AFAICS.
> Considering
>
>   UTF_String'("'Ä' is A-Umlaut");
>
> the literal, if taken at face value. doesn't say which characters
> there are going to be.

It does exactly this, once you define "character".

> It takes an Ada compiler to interpret the
> source text and decide whether it is representing Latin-1 or
> a multi-octed sequence, possibly one that needs Wide_Character
> or Wide_Wide_Character.

There is nothing to interpret considering literals of Universal_String. 
It is no different from the way Universal_Integer is handled. String and 
UTF-8 string and Wide string can be considered subtypes of 
Universal_String, that does not have effect on the relationships between 
String and UTF-8 string. Same if literals considered overloaded 
functions. No different.

>> "Remainders are values ... in Character" makes no sense either.
>
>   Character'Val (N rem 256);

So what? Numeric characters is still a constrained subtype of character 
type.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18 16:35               ` Dmitry A. Kazakov
@ 2016-10-18 17:35                 ` G.B.
  2016-10-18 20:03                   ` Dmitry A. Kazakov
  0 siblings, 1 reply; 30+ messages in thread
From: G.B. @ 2016-10-18 17:35 UTC (permalink / raw)


On 18.10.16 18:35, Dmitry A. Kazakov wrote:
> No invariant can make Latin-1 A-umlaut UTF-8 A-umlaut.

Who would ever want to do that?

Before I/O, there is nothing. UTF_8_String is for encoding
and decoding subprograms of Ada. For them to be successful,
a predicate could be used to express the set of values that
can be parsed. It so happens that its members are officially
said to be in encoded form.

To get a subset U from a set S, you apply a constraint
to S. That's not (easily) expressible in Ada in this case.
But if it is, with the help of a predicate, the we can
say that UTF_8_String is-a "constrained" String because
their sets are.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18 17:35                 ` G.B.
@ 2016-10-18 20:03                   ` Dmitry A. Kazakov
  2016-10-19  8:15                     ` G.B.
  0 siblings, 1 reply; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-18 20:03 UTC (permalink / raw)

On 2016-10-18 19:35, G.B. wrote:
> On 18.10.16 18:35, Dmitry A. Kazakov wrote:
>> No invariant can make Latin-1 A-umlaut UTF-8 A-umlaut.
>
> Who would ever want to do that?

Somebody claiming that UTF-8 string is a constrained subtype of Latin-1 
string.

> To get a subset U from a set S, you apply a constraint
> to S. That's not (easily) expressible in Ada in this case.

There is no such constraint at all. A-umlaut in Latin-1 is one 
character, in UTF-8 it is two characters.

To introduce a subtype relationship we need a conversion, not a 
constraint. Ada does not support this method of subtype construction.

> But if it is, with the help of a predicate, the we can
> say that UTF_8_String is-a "constrained" String because
> their sets are.

They are not, as demonstrated on the example of A-umlaut.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18 20:03                   ` Dmitry A. Kazakov
@ 2016-10-19  8:15                     ` G.B.
  2016-10-19  8:25                       ` G.B.
  2016-10-19  8:49                       ` Dmitry A. Kazakov
  0 siblings, 2 replies; 30+ messages in thread
From: G.B. @ 2016-10-19  8:15 UTC (permalink / raw)

On 18.10.16 22:03, Dmitry A. Kazakov wrote:
> On 2016-10-18 19:35, G.B. wrote:
>> On 18.10.16 18:35, Dmitry A. Kazakov wrote:
>>> No invariant can make Latin-1 A-umlaut UTF-8 A-umlaut.
>>
>> Who would ever want to do that?
>
> Somebody claiming that UTF-8 string is a constrained subtype of Latin-1 string.

But I do not claim this!

The misconception is to think that String is meant to be
Latin-1 String. String isn't Latin-1 String. Ada states
a *correspondence*, but no essence at all.

In fact, reading Japanese, or Polish, or Hebrew text would
be impossible to do in Ada if String was Latin-1!

Yes, character sets in Ada do not have types.

>> To get a subset U from a set S, you apply a constraint
>> to S. That's not (easily) expressible in Ada in this case.
>
> There is no such constraint at all. A-umlaut in Latin-1 is one character, in UTF-8 it is two characters.

In Ada, A-Umlaut is not a character in Latin-1,
In Ada, A-Umlaut is not a character in UTF-8.

Reason: Latin-1 and UTF-8 describe encoded forms, as do
KOI8-R, ISO-8859-15, Shift_JIS, or CP 1252. Some only
happen to list, and some only indicate a repertoire of
corresponding characters also.

A-Umlaut is a character, lower case C.

> To introduce a subtype relationship we need a conversion, not a constraint. Ada does not support this method of subtype construction.

An Ada-subtype relationship is designed to avoid conversion,
And so it is distinguishable by its constraint, and its name,
only.

Where we would be needing conversion, were Ada to have
types for character sets and so on, we now have operations
such as Encode, Decode, and Convert. Together with statements
of correspondence and normative reference in the RM.

But both do not prevent identifying a subset of valid values
of dumb type String that constitute the subset of UTF_8_String.
Or that of a to-be-defined (trivial) subtype Latin_1_String.

    type Latin_String is String;
    --   RM blah blah ...

    type Latin_1_String is String;

-- 
"HOTDOGS ARE NOT BOOKMARKS"
Springfield Elementary teaching staff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-19  8:15                     ` G.B.
@ 2016-10-19  8:25                       ` G.B.
  2016-10-19  8:49                       ` Dmitry A. Kazakov
  1 sibling, 0 replies; 30+ messages in thread
From: G.B. @ 2016-10-19  8:25 UTC (permalink / raw)


On 19.10.16 10:15, G.B. wrote:
>    type Latin_String is String;
>    --   RM blah blah ...
>
>    type Latin_1_String is String;
subtype ...

Sorry.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-19  8:15                     ` G.B.
  2016-10-19  8:25                       ` G.B.
@ 2016-10-19  8:49                       ` Dmitry A. Kazakov
  2016-10-19 14:20                         ` G.B.
  1 sibling, 1 reply; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-19  8:49 UTC (permalink / raw)

On 19/10/2016 10:15, G.B. wrote:
> On 18.10.16 22:03, Dmitry A. Kazakov wrote:
>> On 2016-10-18 19:35, G.B. wrote:
>>> On 18.10.16 18:35, Dmitry A. Kazakov wrote:
>>>> No invariant can make Latin-1 A-umlaut UTF-8 A-umlaut.
>>>
>>> Who would ever want to do that?
>>
>> Somebody claiming that UTF-8 string is a constrained subtype of
>> Latin-1 string.
>
> But I do not claim this!
>
> The misconception is to think that String is meant to be
> Latin-1 String. String isn't Latin-1 String. Ada states
> a *correspondence*, but no essence at all.

3.5.2

"The predefined type Character is a character type whose values 
correspond to the 256 code positions of Row 00 (also known as Latin-1) 
of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP)."

String means Latin-1. You can use it as if it meant something else, e.g. 
UTF-8 string or UCS-2 string or PDP-11 machine code. That would prove 
nothing except your willingness to go untyped.

> In fact, reading Japanese, or Polish, or Hebrew text would
> be impossible to do in Ada if String was Latin-1!

Polish alphabet is Latin based, BTW.

Yes, you need to break the type system in order to re-interpret String 
as a UTF-8 string. You cannot do it in a typed way, that is the whole 
point. Latin-1 and UTF-8 strings are not subtypes unless you break 
types. Once you did it does not make any sense to talk about subtypes 
anymore. Subtype presumes keeping if not all (LSP subtype) but some of 
vital properties. Re-interpreted Latin-1 to UTF-8 strings keep almost 
none of string properties.

>>> To get a subset U from a set S, you apply a constraint
>>> to S. That's not (easily) expressible in Ada in this case.
>>
>> There is no such constraint at all. A-umlaut in Latin-1 is one
>> character, in UTF-8 it is two characters.
>
> In Ada, A-Umlaut is not a character in Latin-1,

It is. ISO/IEC 8859-1

> Where we would be needing conversion, were Ada to have
> types for character sets and so on, we now have operations
> such as Encode, Decode, and Convert.

Yep, Ada goes untyped mess. Again, it is not an ill will to make C out 
of Ada, it is merely a deficiency of Ada type system to do it properly. 
We cannot do it with generics or constrained subtypes, so we drop typing 
to have at least something.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-19  8:49                       ` Dmitry A. Kazakov
@ 2016-10-19 14:20                         ` G.B.
  2016-10-19 16:20                           ` Dmitry A. Kazakov
  0 siblings, 1 reply; 30+ messages in thread
From: G.B. @ 2016-10-19 14:20 UTC (permalink / raw)

On 19.10.16 10:49, Dmitry A. Kazakov wrote:

>> The misconception is to think that String is meant to be
>> Latin-1 String. String isn't Latin-1 String. Ada states
>> a *correspondence*, but no essence at all.
>
> 3.5.2
>
> "The predefined type Character is a character type whose values
> correspond to the 256 code positions of Row 00 (also known as Latin-1)
   ^^^^^^^^^^
> of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP)."

Exactly, it means, values aren't Latin-1, they correspond
to Latin-1 code points. (To be /= To correspond to.)

>>>> To get a subset U from a set S, you apply a constraint
>>>> to S. That's not (easily) expressible in Ada in this case.
>>>
>>> There is no such constraint at all. A-umlaut in Latin-1 is one
>>> character, in UTF-8 it is two characters.

A-Umlaut is a character, not a character-in-Some-Encoding-Form.
'€' is one, too, as are the four in "Łódź" that the man named
"Artiñano" (8 characters) could not manage to type into his letter
without accidentally spoiling his last name.

>> In Ada, A-Umlaut is not a character in Latin-1,
>
> It is. ISO/IEC 8859-1

For Ada, A-Umlaut is ("essence" vs "correspondence") not
a character in ISO/IEC 8859-1, but there exist correspondences
between A-Umlaut and the Ada Character and ISO/IEC 8859-1.
And we "cannot do it in a typed way, that is the whole point".

> it is merely a deficiency of Ada type system to do it properly.
> We cannot do it with generics or constrained subtypes, so we drop typing
> to have at least something.

Ada can add a constraining aspect to a type derived from String
so as to formally specify the set of values in that type.
In a way similar to

   type US_Elevator is new Integer range -10 .. 500
      with
        Static_Predicate => US_Elevator /= 13;

The short, informal name of that computable, exact specification
by a Predicate for the former type derived from String is "UTF-8".

It gives one-way substitutability: you can use a value of the
derived type wherever you can use a value of type String, if
there ever is a need for doing so (e.g. dumb String'Write can be
reused after Convert-ing to UTF_8_String (encoding)).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-19 14:20                         ` G.B.
@ 2016-10-19 16:20                           ` Dmitry A. Kazakov
  0 siblings, 0 replies; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-19 16:20 UTC (permalink / raw)

On 2016-10-19 16:20, G.B. wrote:
> On 19.10.16 10:49, Dmitry A. Kazakov wrote:
>
>>> The misconception is to think that String is meant to be
>>> Latin-1 String. String isn't Latin-1 String. Ada states
>>> a *correspondence*, but no essence at all.
>>
>> 3.5.2
>>
>> "The predefined type Character is a character type whose values
>> correspond to the 256 code positions of Row 00 (also known as Latin-1)
>   ^^^^^^^^^^
>> of the ISO/IEC 10646:2003 Basic Multilingual Plane (BMP)."
>
> Exactly, it means, values aren't Latin-1, they correspond
> to Latin-1 code points. (To be /= To correspond to.)

They are. The language is necessarily sloppy for the sake of simplicity. 
Values are Latin-1. The corresponding language character objects (which 
are customary called "values" too) correspond, represent these values. 
There is no reason to distinguish language values and problem space 
values they represent so long there is no confusion.

Anyway it does not change anything in the discussion. Same objects of 
String and UTF-8 Strings correspond/represent different problem space 
values. Sameness is defined as equality "=".

>>>>> To get a subset U from a set S, you apply a constraint
>>>>> to S. That's not (easily) expressible in Ada in this case.
>>>>
>>>> There is no such constraint at all. A-umlaut in Latin-1 is one
>>>> character, in UTF-8 it is two characters.
>
> A-Umlaut is a character, not a character-in-Some-Encoding-Form.

The text you quote states exactly that.

>>> In Ada, A-Umlaut is not a character in Latin-1,
>>
>> It is. ISO/IEC 8859-1
>
> For Ada, A-Umlaut is ("essence" vs "correspondence") not
> a character in ISO/IEC 8859-1, but there exist correspondences
> between A-Umlaut and the Ada Character and ISO/IEC 8859-1.

Ada character objects represent characters defined in ISO/IEC 8859-1. 
For each object there is one and only one ISO/IEC 8859-1 character and 
conversely for each ISO/IEC 8859-1 character there one and only one Ada 
character value.

> And we "cannot do it in a typed way, that is the whole point".
>
>> it is merely a deficiency of Ada type system to do it properly.
>> We cannot do it with generics or constrained subtypes, so we drop typing
>> to have at least something.
>
> Ada can add a constraining aspect to a type derived from String
> so as to formally specify the set of values in that type.

That won't be a string subtype, a property considered more important 
than being a proper subtype.

There is no language subtype that could represent a subtype relationship 
between sequences of *same* characters having *different* encoding 
(representation). Which is the essence of the problem.

> The short, informal name of that computable, exact specification
> by a Predicate for the former type derived from String is "UTF-8".
>
> It gives one-way substitutability: you can use a value of the
> derived type wherever you can use a value of type String, if
> there ever is a need for doing so (e.g. dumb String'Write can be
> reused after Convert-ing to UTF_8_String (encoding)).

See the example with A-umlaut illustrates.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18  8:45       ` Dmitry A. Kazakov
  2016-10-18 10:09         ` G.B.
@ 2016-10-20  0:31         ` Randy Brukardt
  2016-10-20  7:36           ` Dmitry A. Kazakov
  2016-10-28 21:08         ` Shark8
  2 siblings, 1 reply; 30+ messages in thread
From: Randy Brukardt @ 2016-10-20  0:31 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:nu4nee$18le$1@gioia.aioe.org...
...
> Numeric character is a constraint expressible in Ada:
>
>    subtype Numeric is Character range '0'..'9';
>
> Numeric string constraint is not expressible, but it still a constraint.

It's expressible as a predicate, though; that's the entire point of 
predicates (to act like user-defined constraints):

    subtype Numeric_String is String
        with Dynamic_Predicate => (for all E of Numeric_String => E in 
Numeric);

It's not 100% as good as a constraint (as modifications of individual 
components won't be checked), but it almost always will do the job.

You also could declare a new type with the proper constraint:
    type Numeric_String is array (Positive range <>) of Numeric;

That will have all of the string operations, but it (unfortunately) can't be 
converted to String (you'd have to write a function to do that).

Since both of these possibilities exist, I'd hardly call the constraint "not 
expressible". At worst, it's inconvinient to express it.

                           Randy.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-20  0:31         ` Randy Brukardt
@ 2016-10-20  7:36           ` Dmitry A. Kazakov
  2016-10-21 12:28             ` G.B.
  2016-10-22  1:53             ` Randy Brukardt
  0 siblings, 2 replies; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-20  7:36 UTC (permalink / raw)


On 20/10/2016 02:31, Randy Brukardt wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
> news:nu4nee$18le$1@gioia.aioe.org...
> ...
>> Numeric character is a constraint expressible in Ada:
>>
>>    subtype Numeric is Character range '0'..'9';
>>
>> Numeric string constraint is not expressible, but it still a constraint.
>
> It's expressible as a predicate, though; that's the entire point of
> predicates (to act like user-defined constraints):
>
>     subtype Numeric_String is String
>         with Dynamic_Predicate => (for all E of Numeric_String => E in
> Numeric);
>
> It's not 100% as good as a constraint (as modifications of individual
> components won't be checked), but it almost always will do the job.

Not nice. Is there a reason why, apart from premature optimization?

> You also could declare a new type with the proper constraint:
>     type Numeric_String is array (Positive range <>) of Numeric;
>
> That will have all of the string operations, but it (unfortunately) can't be
> converted to String (you'd have to write a function to do that).
>
> Since both of these possibilities exist, I'd hardly call the constraint "not
> expressible". At worst, it's inconvinient to express it.

Yes, maybe.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-20  7:36           ` Dmitry A. Kazakov
@ 2016-10-21 12:28             ` G.B.
  2016-10-21 16:13               ` Lucretia
  2016-10-22  1:53             ` Randy Brukardt
  1 sibling, 1 reply; 30+ messages in thread
From: G.B. @ 2016-10-21 12:28 UTC (permalink / raw)


On 20.10.16 09:36, Dmitry A. Kazakov wrote:
> On 20/10/2016 02:31, Randy Brukardt wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
>> news:nu4nee$18le$1@gioia.aioe.org...
>> ...
>>> Numeric character is a constraint expressible in Ada:
>>>
>>>    subtype Numeric is Character range '0'..'9';
>>>
>>> Numeric string constraint is not expressible, but it still a constraint.
>>
>> It's expressible as a predicate, though; that's the entire point of
>> predicates (to act like user-defined constraints):
>>
>>     subtype Numeric_String is String
>>         with Dynamic_Predicate => (for all E of Numeric_String => E in
>> Numeric);
>>
>> It's not 100% as good as a constraint (as modifications of individual
>> components won't be checked), but it almost always will do the job.
>
> Not nice. Is there a reason why, apart from premature optimization?

I think you can add an aspect to the component type
and have that checked on assignment to a component.
The aspect could somehow be different from the
constraint, also just repeating it appears to loop infinitely
with current GNATs.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78066


Anyway, a little inconvenience for starters:

     subtype My_Utf_8_String is String
       --  or, when not String, some array of any component type
       --  suitable as a byte sequence item type
       with Dynamic_Predicate => Is_Well_Formed (My_Utf_8_String);

     Bom: constant String := String'(Character'Val (16#EF#),
                                     Character'Val (16#BB#),
                                     Character'Val (16#BF#));

     function Has_Bom (U8: String) return Boolean is
       (U8'Length >= 3
          and then U8 (U8'First .. U8'First + 2) = Bom);

     function "abs" is new Ada.Unchecked_Conversion
       (Character, Interfaces.Unsigned_8);

     function Is_Well_Formed (U8 : String) return Boolean is
     --  `U8` has permissible bit patterns for all bytes. (No Table 3.7
     --  support.)
       ((if U8'Length > 0 then
           (if Has_Bom (U8)
            then
              Is_Well_Formed (U8 (U8'First + 3 .. U8'Last))
            else
              (for all J in U8'Range =>
                  (case abs U8 (J) is
                      when 2#0_0000000# .. 2#0_1111111# =>
                          --  ASCII compatibility
                          True,
                      when 2#10_000000# .. 2#10_111111# =>
                          --  is a following byte
                         (if J > U8'First then
                            (abs U8 (J - 1)
                               in 2#110_00000# .. 2#110_11111#
                               or abs U8 (J - 1)
                               in 2#1110_0000# .. 2#1110_1111#
                               or abs U8 (J - 1)
                               in 2#11110_000# .. 2#11110_111#)
                          else
                            False
                         ),
                      when 2#110_00000# .. 2#110_11111# =>
                         (if J < U8'Last then
                            (abs U8 (J + 1)
                               in 2#10_000000# .. 2#10_111111#)
                          else
                            False),
                      when 2#1110_0000# .. 2#1110_1111# =>
                         (if J + 1 < U8'Last then
                            (for all K in J + 1 .. J + 2 =>
                               abs U8 (K)
                               in 2#10_000000# .. 2#10_111111#)
                          else
                            False
                         ),
                      when 2#11110_000# .. 2#11110_111# =>
                         (if J + 2 < U8'Last then
                            (for all K in J + 1 .. J + 3 =>
                               abs U8 (K)
                               in 2#10_000000# .. 2#10_111111#)
                          else
                            False
                         ),
                      when 2#11111_000# .. 2#11111_111# =>
                          --  not in Table 3.6 (UTF-8 Bit Distribution)
                          False
                  )
              )
           )
           --  String of length 0:
         else True));

     Test_Bom : constant My_Utf_8_String := Bom & "ABC";
     Test_US : constant My_Utf_8_String := "ABC";
     Test_GR : constant My_Utf_8_String := "ΑΒΓ";
     Test_RU : constant My_Utf_8_String := "АБГ";
     Test_Xx : constant My_Utf_8_String :=
       ('A', Character'Val (16#E4#), 'E');


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-21 12:28             ` G.B.
@ 2016-10-21 16:13               ` Lucretia
  2016-10-21 16:43                 ` Dmitry A. Kazakov
  0 siblings, 1 reply; 30+ messages in thread
From: Lucretia @ 2016-10-21 16:13 UTC (permalink / raw)


On Friday, 21 October 2016 13:28:52 UTC+1, G.B.  wrote:

>      Test_Bom : constant My_Utf_8_String := Bom & "ABC";
>      Test_US : constant My_Utf_8_String := "ABC";
>      Test_GR : constant My_Utf_8_String := "ΑΒΓ";
>      Test_RU : constant My_Utf_8_String := "АБГ";
>      Test_Xx : constant My_Utf_8_String :=
>        ('A', Character'Val (16#E4#), 'E');

Also, the most inefficient string ever:

Appended : My_UTF_8_String := "App";

Appended := Some_Other_String & 'e';  --  Call's Is_Well_Formed for each assignment! Sloooooooooooooow
Appended := Some_Other_String & 'n';
Appended := Some_Other_String & 'd';


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-21 16:13               ` Lucretia
@ 2016-10-21 16:43                 ` Dmitry A. Kazakov
  2016-10-22  5:51                   ` G.B.
  0 siblings, 1 reply; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-21 16:43 UTC (permalink / raw)


On 2016-10-21 18:13, Lucretia wrote:
> On Friday, 21 October 2016 13:28:52 UTC+1, G.B.  wrote:
>
>>      Test_Bom : constant My_Utf_8_String := Bom & "ABC";
>>      Test_US : constant My_Utf_8_String := "ABC";
>>      Test_GR : constant My_Utf_8_String := "ΑΒΓ";
>>      Test_RU : constant My_Utf_8_String := "АБГ";
>>      Test_Xx : constant My_Utf_8_String :=
>>        ('A', Character'Val (16#E4#), 'E');
>
> Also, the most inefficient string ever:
>
> Appended : My_UTF_8_String := "App";
>
> Appended := Some_Other_String & 'e';  --  Call's Is_Well_Formed for each assignment! Sloooooooooooooow
> Appended := Some_Other_String & 'n';
> Appended := Some_Other_String & 'd';

For an UTF-8 string proper no checks would be ever required when a 
character is appanded.

The above is a sorry mess of representation colliding with the 
semantics, octets with characters. 'e' is a Latin-1 character appended 
as an octet while Unicode character meant.

Wrong design gets always punished this way or another.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-20  7:36           ` Dmitry A. Kazakov
  2016-10-21 12:28             ` G.B.
@ 2016-10-22  1:53             ` Randy Brukardt
  1 sibling, 0 replies; 30+ messages in thread
From: Randy Brukardt @ 2016-10-22  1:53 UTC (permalink / raw)

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:nu9s5v$18f0$1@gioia.aioe.org...
> On 20/10/2016 02:31, Randy Brukardt wrote:
...
>> It's not 100% as good as a constraint (as modifications of individual
>> components won't be checked), but it almost always will do the job.
>
> Not nice. Is there a reason why, apart from premature optimization?

Sure, checking after every component change would require passing some sort 
of checker subprogram with every reference parameter (since the actual could 
be part of some object that needs predicate checking). That sort of overhead 
would be completely unacceptable, especially as it would rarely be used.

As such, we stuck with the model that the checks are made in the same places 
that whole object constraint checks are made (such as discriminant checks). 
For private types, there is no difference, but some failures might be 
detected late for types with visible components (like String).

                             Randy.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-21 16:43                 ` Dmitry A. Kazakov
@ 2016-10-22  5:51                   ` G.B.
  2016-10-22  7:49                     ` Dmitry A. Kazakov
  0 siblings, 1 reply; 30+ messages in thread
From: G.B. @ 2016-10-22  5:51 UTC (permalink / raw)


On 21.10.16 18:43, Dmitry A. Kazakov wrote:
> For an UTF-8 string proper no checks would be ever required when a character is appanded.

No Unicode sequence in UTF should ever exist visibly in a
program other than either during parsing, or during output.

-- 
"HOTDOGS ARE NOT BOOKMARKS"
Springfield Elementary teaching staff


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-22  5:51                   ` G.B.
@ 2016-10-22  7:49                     ` Dmitry A. Kazakov
  2016-10-24 11:35                       ` Luke A. Guest
  0 siblings, 1 reply; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-22  7:49 UTC (permalink / raw)


On 2016-10-22 07:51, G.B. wrote:
> On 21.10.16 18:43, Dmitry A. Kazakov wrote:
>> For an UTF-8 string proper no checks would be ever required when a
>> character is appanded.
>
> No Unicode sequence in UTF should ever exist visibly in a
> program other than either during parsing, or during output.

Right.

Any encoded string must implement two distinct interfaces: an array of 
characters and a sequence of encoding elements (e.g. octets). They 
somehow fit to each other for Latin-1 and UCS-2 strings, but for 
majority of encoding methods they are drastically different.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-22  7:49                     ` Dmitry A. Kazakov
@ 2016-10-24 11:35                       ` Luke A. Guest
  2016-10-24 13:01                         ` Dmitry A. Kazakov
  0 siblings, 1 reply; 30+ messages in thread
From: Luke A. Guest @ 2016-10-24 11:35 UTC (permalink / raw)


Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
> On 2016-10-22 07:51, G.B. wrote:
>> On 21.10.16 18:43, Dmitry A. Kazakov wrote:
>>> For an UTF-8 string proper no checks would be ever required when a
>>> character is appanded.
>> 
>> No Unicode sequence in UTF should ever exist visibly in a
>> program other than either during parsing, or during output.
> 
> Right.
> 
> Any encoded string must implement two distinct interfaces: an array of 
> characters and a sequence of encoding elements (e.g. octets). They 
> somehow fit to each other for Latin-1 and UCS-2 strings, but for 
> majority of encoding methods they are drastically different.
> 

There's no such thing as a character, there are octets for Utf-8 and code
points. 

You should also implement graphème clutter access too.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-24 11:35                       ` Luke A. Guest
@ 2016-10-24 13:01                         ` Dmitry A. Kazakov
  2016-10-24 14:54                           ` Luke A. Guest
  0 siblings, 1 reply; 30+ messages in thread
From: Dmitry A. Kazakov @ 2016-10-24 13:01 UTC (permalink / raw)


On 24/10/2016 13:35, Luke A. Guest wrote:
> Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:
>> On 2016-10-22 07:51, G.B. wrote:
>>> On 21.10.16 18:43, Dmitry A. Kazakov wrote:
>>>> For an UTF-8 string proper no checks would be ever required when a
>>>> character is appanded.
>>>
>>> No Unicode sequence in UTF should ever exist visibly in a
>>> program other than either during parsing, or during output.
>>
>> Right.
>>
>> Any encoded string must implement two distinct interfaces: an array of
>> characters and a sequence of encoding elements (e.g. octets). They
>> somehow fit to each other for Latin-1 and UCS-2 strings, but for
>> majority of encoding methods they are drastically different.
>
> There's no such thing as a character, there are octets for Utf-8 and code
> points.

Code points or Wide_Wide_Character, it does not really matter for 
practical applications.

> You should also implement graphème clutter access too.

Not really needed.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-24 13:01                         ` Dmitry A. Kazakov
@ 2016-10-24 14:54                           ` Luke A. Guest
  0 siblings, 0 replies; 30+ messages in thread
From: Luke A. Guest @ 2016-10-24 14:54 UTC (permalink / raw)


Dmitry A. Kazakov <mailbox@dmitry-kazakov.de> wrote:

> Code points or Wide_Wide_Character, it does not really matter for 
> practical applications.
> 
>> You should also implement graphème clutter access too.
> 
> Not really needed.
> 

It is if you're doing rendering.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Bug in Ada - Latin 1 is not a subset of UTF-8
  2016-10-18  8:45       ` Dmitry A. Kazakov
  2016-10-18 10:09         ` G.B.
  2016-10-20  0:31         ` Randy Brukardt
@ 2016-10-28 21:08         ` Shark8
  2 siblings, 0 replies; 30+ messages in thread
From: Shark8 @ 2016-10-28 21:08 UTC (permalink / raw)


On Tuesday, October 18, 2016 at 2:45:05 AM UTC-6, Dmitry A. Kazakov wrote:
> On 18/10/2016 10:23, G.B. wrote:
> > On 18.10.16 09:41, Dmitry A. Kazakov wrote:
> >> On 18/10/2016 01:25, G.B. wrote:
> >>> On 17.10.16 22:18, Lucretia wrote:
> >>
> >>> According to ISO 10646, UTF stands for UCS Transformation
> >>> Format. So, it's a format, suggesting a representation.
> >>>
> >>> On similar grounds, one could define a string subtype for
> >>> other types of objects, for example
> >>>
> >>>   subtype Number_String is String;
> >>
> >> You are wrong.
> >
> > The constraints on either UTF_String or or Number_String are
> > not expressible as simple Ada subtypes. They are given by
> > description and normative reference, respectively.
> 
> In the case of UTF-8 it is not a constraint. "Ä" has different 
> representations as Latin-1 and UTF-8 strings.
> 
> Numeric character is a constraint expressible in Ada:
> 
>     subtype Numeric is Character range '0'..'9';
> 
> Numeric string constraint is not expressible, but it still a constraint.


You are wrong; it is expressible:
   -- Using your Numeric character subtype.
   Subtype Digits is String
    with Dynamic_Predicate => (for all Ch of Digits => Ch in Numeric);

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-10-28 21:08 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-17 20:18 Bug in Ada - Latin 1 is not a subset of UTF-8 Lucretia
2016-10-17 20:57 ` Jacob Sparre Andersen
2016-10-18  5:44   ` J-P. Rosen
2016-10-17 23:25 ` G.B.
2016-10-18  7:41   ` Dmitry A. Kazakov
2016-10-18  8:23     ` G.B.
2016-10-18  8:45       ` Dmitry A. Kazakov
2016-10-18 10:09         ` G.B.
2016-10-18 12:24           ` Dmitry A. Kazakov
2016-10-18 15:10             ` G.B.
2016-10-18 16:35               ` Dmitry A. Kazakov
2016-10-18 17:35                 ` G.B.
2016-10-18 20:03                   ` Dmitry A. Kazakov
2016-10-19  8:15                     ` G.B.
2016-10-19  8:25                       ` G.B.
2016-10-19  8:49                       ` Dmitry A. Kazakov
2016-10-19 14:20                         ` G.B.
2016-10-19 16:20                           ` Dmitry A. Kazakov
2016-10-20  0:31         ` Randy Brukardt
2016-10-20  7:36           ` Dmitry A. Kazakov
2016-10-21 12:28             ` G.B.
2016-10-21 16:13               ` Lucretia
2016-10-21 16:43                 ` Dmitry A. Kazakov
2016-10-22  5:51                   ` G.B.
2016-10-22  7:49                     ` Dmitry A. Kazakov
2016-10-24 11:35                       ` Luke A. Guest
2016-10-24 13:01                         ` Dmitry A. Kazakov
2016-10-24 14:54                           ` Luke A. Guest
2016-10-22  1:53             ` Randy Brukardt
2016-10-28 21:08         ` Shark8

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox