UTF-8 Output and "-gnatW8"

comp.lang.ada
 help / color / mirror / Atom feed

* UTF-8 Output and "-gnatW8"
@ 2016-03-24 17:23 Michael Rohan
  2016-03-24 22:09 ` Randy Brukardt
  2016-03-25  5:54 ` rieachus
  0 siblings, 2 replies; 12+ messages in thread
From: Michael Rohan @ 2016-03-24 17:23 UTC (permalink / raw)


Hi Folks,

I'm seeing, what I suspect, is a GNAT run-time encoding of an already encoded UTF-8 string when "-gnatW8" option is used.  The help info on "-gnatW8" states

-gnatW?   Wide character encoding method (?=h/u/s/e/8/b)

I've been using this option is state that my source files are UTF-8 encoded but I don't particular want to change the behaviour of the Ada.Text_IO routines.  I don't see an option that covers just the source file encoding without impacting the Text_IO (narrow) functionality.

I'm going to adjust my build process to only used "-gnatW8" when compiling sources that contain non-ASCII, UTF-8 characters.

It's pretty easy to see this.  Here's an already UTF-8 encoded string example:

with Ada.Text_IO;
procedure PiDay is
begin
   Ada.Text_IO.Put_Line (
      "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day.");
end PiDay;

Building and executing with and without "-gnatW8" gives

$ gnatmake piday
gcc -c piday.adb
gnatbind -x piday.ali
gnatlink piday.ali
$ ./piday 
It's π day.
$ touch piday.adb 
$ gnatmake -gnatW8 piday
gcc -c -gnatW8 piday.adb
gnatbind -x piday.ali
gnatlink piday.ali
$ ./piday 
It's Ï€ day.

The RM includes an "Implementation Requirement":

16/3
 An Ada implementation shall accept Ada source code in UTF-8 encoding, with or without a BOM (see A.4.11), where every character is represented by its code point. The character pair CARRIAGE RETURN/LINE FEED (code points 16#0D# 16#0A#) signifies a single end of line (see 2.2); every other occurrence of a format_effector other than the character whose code point position is 16#09# (CHARACTER TABULATION) also signifies a single end of line.

It feels like we should be able to explicitly define the encoding for a source via pragma:

    pragma Character_Set ("UTF-8");

Take care,
Michael.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-24 17:23 UTF-8 Output and "-gnatW8" Michael Rohan
@ 2016-03-24 22:09 ` Randy Brukardt
  2016-03-24 22:34   ` Michael Rohan
  2016-03-25  5:54 ` rieachus
  1 sibling, 1 reply; 12+ messages in thread
From: Randy Brukardt @ 2016-03-24 22:09 UTC (permalink / raw)


"Michael Rohan" <michael@zanyblue.com> wrote in message 
news:35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com...
...
>I've been using this option is state that my source files are UTF-8 encoded 
>but
>I don't particular want to change the behaviour of the Ada.Text_IO 
>routines.

I don't see any reason that the character encoding option ought to change 
the runtime behavior of anything - it ought to just tell the compiler about 
the form of the source code. But I'm definitely not an expert in GNAT.

>  I don't see an option that covers just the source file encoding without 
> impacting the Text_IO (narrow) functionality.

I don't see anything in the documentation you posted that it has any effect 
on Text_IO, nor would I expect it to, since it says it controls the 
representation of Wide_Characters, and there are no wide characters 
associated with Text_IO.

>It's pretty easy to see this.  Here's an already UTF-8 encoded string 
>example:
>
>with Ada.Text_IO;
>procedure PiDay is
>begin
>   Ada.Text_IO.Put_Line (
>      "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day.");
>end PiDay;

Since this program text doesn't include any wide characters, there should be 
no effect on the behavior of Text_IO.

I think what you are seeing is just a bug; I'd suggest report it as a bug to 
AdaCore and see what they say. (Even if they intended something to happen 
here, it seems to be a horribly bad idea.) My guess is that they are folding 
the string literal and then encoding that into UTF-8, even though such 
encoding is too late.

>The RM includes an "Implementation Requirement":
>
>16/3
> An Ada implementation shall accept Ada source code in UTF-8 encoding, with 
> or
> without a BOM (see A.4.11), where every character is represented by its 
> code
> point. The character pair CARRIAGE RETURN/LINE FEED (code points
>16#0D# 16#0A#) signifies a single end of line (see 2.2); every other 
>occurrence
> of a format_effector other than the character whose code point position is 
> 16#09#
> (CHARACTER TABULATION) also signifies a single end of line.

Two points here:

(1) The Ada Standard requires no other encoding. The expectation is that in 
the long term, all Ada (portable) source code will be encoded in UTF-8. 
There's no requirement for a compiler to support anything else, and the only 
need beyond that is to process legacy code -- a tool similar to GNATChop 
could handle that without messing up the compiler. (Note that the ACATS is 
provided only in 7-bit ASCII and UTF-8 encoded files, and the former is a 
subset of the latter.)

(2) This is *only* about the source encoding. It has no effect on anything 
beyond the lexical level of an Ada program. In particular, it has no effect 
on any runtime behavior. Indeed, source encoding is so different than 
anything specified in the Ada Standard that in previous versions of Ada, it 
wasn't specified at all. Source encoding, other than the UTF-8 encoding 
defined in the Standard, is inherently implementation-defined, because the 
intepretation of the encoding has to happen before any Ada rules can be 
applied (from lexical and syntax rules on down).

>It feels like we should be able to explicitly define the encoding for a 
>source via pragma:
>
>    pragma Character_Set ("UTF-8");

This is clearly pointless:
(1) As noted above, the only required source encoding is UTF-8. If you need 
portable code, there is no other choice, and if you don't, you don't need a 
portable way to specify it.
(2) It should be obvious that a pragma is too late. Since such a pragma is 
inside of the source code, and encoded using whatever encoding, by the time 
the compiler recognizes it, it has already been assuming an encoding. And it 
if assumed wrong, it probably couldn't recognize it at all (consider source 
code in EBCDIC or even UCS-2/UTF-16). So at best, it could confirm what the 
compiler already knows. And since it has to be optional (obviously, no 
existing Ada source code has such a pragma), the absence of it doesn't tell 
the compiler anything, either.

So, moral of the story: (A) Use only UTF-8 for portable Ada 2012 code; (B) 
complain to your vendor if the encoding does anything other than determine 
the source code encoding.

                                    Randy.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-24 22:09 ` Randy Brukardt
@ 2016-03-24 22:34   ` Michael Rohan
  2016-03-25 19:15     ` Randy Brukardt
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Rohan @ 2016-03-24 22:34 UTC (permalink / raw)


Hi,

OK, so this might be a compiler bug.  The RM states the character set should
be ISO 10646 so EBCDIC would seem to be something that is not allowed.

The implementation for GNAT impacts the handling of strings, e.g.,

S : constant Wide_String := "π";

With "-gnatW8" this is correctly interpreted as a string of length 1
containing the character U+03C0.  Without the "-gnatW8" option, GNAT
interprets it as a string of Characters to convert to a Wide_String,
i.e., the two character U+00CF and U+0080

Is the constant string value ambiguous here?

Take care,
Michael.

On Thursday, March 24, 2016 at 3:09:33 PM UTC-7, Randy Brukardt wrote:
> "Michael Rohan" <michael@zanyblue.com> wrote in message 
> news:35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com...
> ...
> >I've been using this option is state that my source files are UTF-8 encoded 
> >but
> >I don't particular want to change the behaviour of the Ada.Text_IO 
> >routines.
> 
> I don't see any reason that the character encoding option ought to change 
> the runtime behavior of anything - it ought to just tell the compiler about 
> the form of the source code. But I'm definitely not an expert in GNAT.
> 
> >  I don't see an option that covers just the source file encoding without 
> > impacting the Text_IO (narrow) functionality.
> 
> I don't see anything in the documentation you posted that it has any effect 
> on Text_IO, nor would I expect it to, since it says it controls the 
> representation of Wide_Characters, and there are no wide characters 
> associated with Text_IO.
> 
> >It's pretty easy to see this.  Here's an already UTF-8 encoded string 
> >example:
> >
> >with Ada.Text_IO;
> >procedure PiDay is
> >begin
> >   Ada.Text_IO.Put_Line (
> >      "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day.");
> >end PiDay;
> 
> Since this program text doesn't include any wide characters, there should be 
> no effect on the behavior of Text_IO.
> 
> I think what you are seeing is just a bug; I'd suggest report it as a bug to 
> AdaCore and see what they say. (Even if they intended something to happen 
> here, it seems to be a horribly bad idea.) My guess is that they are folding 
> the string literal and then encoding that into UTF-8, even though such 
> encoding is too late.
> 
> >The RM includes an "Implementation Requirement":
> >
> >16/3
> > An Ada implementation shall accept Ada source code in UTF-8 encoding, with 
> > or
> > without a BOM (see A.4.11), where every character is represented by its 
> > code
> > point. The character pair CARRIAGE RETURN/LINE FEED (code points
> >16#0D# 16#0A#) signifies a single end of line (see 2.2); every other 
> >occurrence
> > of a format_effector other than the character whose code point position is 
> > 16#09#
> > (CHARACTER TABULATION) also signifies a single end of line.
> 
> Two points here:
> 
> (1) The Ada Standard requires no other encoding. The expectation is that in 
> the long term, all Ada (portable) source code will be encoded in UTF-8. 
> There's no requirement for a compiler to support anything else, and the only 
> need beyond that is to process legacy code -- a tool similar to GNATChop 
> could handle that without messing up the compiler. (Note that the ACATS is 
> provided only in 7-bit ASCII and UTF-8 encoded files, and the former is a 
> subset of the latter.)
> 
> (2) This is *only* about the source encoding. It has no effect on anything 
> beyond the lexical level of an Ada program. In particular, it has no effect 
> on any runtime behavior. Indeed, source encoding is so different than 
> anything specified in the Ada Standard that in previous versions of Ada, it 
> wasn't specified at all. Source encoding, other than the UTF-8 encoding 
> defined in the Standard, is inherently implementation-defined, because the 
> intepretation of the encoding has to happen before any Ada rules can be 
> applied (from lexical and syntax rules on down).
> 
> >It feels like we should be able to explicitly define the encoding for a 
> >source via pragma:
> >
> >    pragma Character_Set ("UTF-8");
> 
> This is clearly pointless:
> (1) As noted above, the only required source encoding is UTF-8. If you need 
> portable code, there is no other choice, and if you don't, you don't need a 
> portable way to specify it.
> (2) It should be obvious that a pragma is too late. Since such a pragma is 
> inside of the source code, and encoded using whatever encoding, by the time 
> the compiler recognizes it, it has already been assuming an encoding. And it 
> if assumed wrong, it probably couldn't recognize it at all (consider source 
> code in EBCDIC or even UCS-2/UTF-16). So at best, it could confirm what the 
> compiler already knows. And since it has to be optional (obviously, no 
> existing Ada source code has such a pragma), the absence of it doesn't tell 
> the compiler anything, either.
> 
> So, moral of the story: (A) Use only UTF-8 for portable Ada 2012 code; (B) 
> complain to your vendor if the encoding does anything other than determine 
> the source code encoding.
> 
>                                     Randy.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-24 17:23 UTF-8 Output and "-gnatW8" Michael Rohan
  2016-03-24 22:09 ` Randy Brukardt
@ 2016-03-25  5:54 ` rieachus
  2016-03-25 19:18   ` Randy Brukardt
  1 sibling, 1 reply; 12+ messages in thread
From: rieachus @ 2016-03-25  5:54 UTC (permalink / raw)

On Thursday, March 24, 2016, Michael Rohan wrote:
> The implementation for GNAT impacts the handling of strings, e.g., 
>
> S : constant Wide_String := "π"; 

> With "-gnatW8" this is correctly interpreted as a string of length 1 
> containing the character U+03C0.  Without the "-gnatW8" option, GNAT 
> interprets it as a string of Characters to convert to a Wide_String, 
> i.e., the two character U+00CF and U+0080 

No, it is complex to explain and understand, but once you "get it" you shouldn't have any further problems.

There are (at least) three character encoding in any Ada program, usually more.  It is nice if they can all match, but that can be problematic.  The first is the encoding used in source files.  This obviously can't be chosen by a pragma, and GNAT uses -gnatW8 to force UTF-8.  Since the Ada standard doesn't say anything about the operating system instructions used to call the compiler, this is fine.  Note that the printer character set used for program listings can be different, same with debugger settings, and the character set used by your terminal.

Second are the character set(s) when an Ada program reads from or writes to files or other devices.  The Ada standard defines that binding for Text_IO, Wide_Text_IO, etc.  Notice that this is about character sets, not character encodings.  You could have a Form string "UTF8" for files, with a Wide_Text_IO version that understood it.  Same for "Unicode" and so on.  The program would only see ISO-10646 characters, but the generated files would be much smaller. ;-)

Finally there is the representation of (ISO-10646) characters in source files.  Yes, if the source file uses UTF8 you are fine.  Well not really.  The terminal that you use to create the file may not have the characters you need to use, or in the case of control characters, they may have a different meaning to the compiler.  The standard includes rules for such encodings, and there are (standard) packages which need to use them.

Anyway, you get the standard defined behavior when using -gnatW8, so there is no bug.  The "Ï€" or whatever shows up is probably governed by your terminal's character set. I don't see a listing with Wide_Character(16#CF80#) but you would need an output program that supports it.  (The nice thing about electronic displays is they can support dozens of different character encodings.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-24 22:34   ` Michael Rohan
@ 2016-03-25 19:15     ` Randy Brukardt
  0 siblings, 0 replies; 12+ messages in thread
From: Randy Brukardt @ 2016-03-25 19:15 UTC (permalink / raw)


"Michael Rohan" <michael@zanyblue.com> wrote in message 
news:4f157cc0-1d3c-46d7-ab19-88a13ae0afd0@googlegroups.com...
>Hi,
>
>OK, so this might be a compiler bug.  The RM states the character set 
>should
>be ISO 10646 so EBCDIC would seem to be something that is not allowed.

Ah, that's a common mistake. The RM specifies what the *runtime* character 
set it. Prior to Ada 2012, it said *nothing* about the encoding of Ada 
source code, and even now, it only talks about UTF-8 as one possibility for 
that encoding. Anything else is allowed, including EBCDIC, Shift-JIS, or 
even some sort of tree (the latter is explicitly mentioned as a possibility 
in the AARM). Someone even suggested a source representation where '{' = 
"begin", "} = "end", etc. (It was that suggestion that finally got the UTF-8 
"standard" encoding into the Standard, to provide real interoperability for 
Ada source code.)

>The implementation for GNAT impacts the handling of strings, e.g.,
>
>S : constant Wide_String := "?";
>
>With "-gnatW8" this is correctly interpreted as a string of length 1
>containing the character U+03C0.  Without the "-gnatW8" option, GNAT
>interprets it as a string of Characters to convert to a Wide_String,
>i.e., the two character U+00CF and U+0080

That seems right to me. (In a new compiler, I'd make UTF-8 the default, but 
any existing compiler probably would have to make it a switch of some sort.) 
But that's because you have a UTF-8 character in the source code.

The bug is that you said that some source code with no explicit UTF-8 
characters (rather representing them as Character'Val(16#C0#) and the like) 
was changing behavior in response to such a switch. That's a bug in my 
view(Character'Val(16#C0#) isn't a character literal at compile-time, it's a 
function call, and it's representation is the same regardless of whether the 
source is read as 7-bit ASCII or UTF-8).

>Is the constant string value ambiguous here?

It means something different depending upon the source representation. I 
belive GNAT is getting that correct.

"Character'Val(16#C0#)" means the same thing in either source 
representation, so you should get the same results for the program 
containing that. If you don't, that's a bug.

Hope this clears it up.

                                    Randy.

. 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-25  5:54 ` rieachus
@ 2016-03-25 19:18   ` Randy Brukardt
  2016-03-28 22:48     ` Michael Rohan
  0 siblings, 1 reply; 12+ messages in thread
From: Randy Brukardt @ 2016-03-25 19:18 UTC (permalink / raw)


<rieachus@comcast.net> wrote in message 
news:3a65e71c-41ee-49eb-916d-c0be8be9abc6@googlegroups.com...
...
>Second are the character set(s) when an Ada program reads from or writes to 
>files
>or other devices.  The Ada standard defines that binding for Text_IO, 
>Wide_Text_IO,
>etc.  Notice that this is about character sets, not character encodings. 
>You could have
>a Form string "UTF8" for files, with a Wide_Text_IO version that understood 
>it.  Same
>for "Unicode" and so on.  The program would only see ISO-10646 characters, 
>but the
>generated files would be much smaller. ;-)

Or you could use Ada.Strings.Encodings to convert the string to UTF-8 and 
then use Ada.Text_IO to output it. (This is an end-run round strong typing, 
sadly, but it works.)

                                   Randy.

P.S. Robert, nice to hear from you again. It's been a while, hope you're 
doing well.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-25 19:18   ` Randy Brukardt
@ 2016-03-28 22:48     ` Michael Rohan
  2016-03-29  7:44       ` Dmitry A. Kazakov
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Michael Rohan @ 2016-03-28 22:48 UTC (permalink / raw)

On Friday, March 25, 2016 at 12:19:01 PM UTC-7, Randy Brukardt wrote:
> <rieachus@comcast.net> wrote in message 
> news:3a65e71c-41ee-49eb-916d-c0be8be9abc6@googlegroups.com...
> ...
> >Second are the character set(s) when an Ada program reads from or writes to 
> >files
> >or other devices.  The Ada standard defines that binding for Text_IO, 
> >Wide_Text_IO,
> >etc.  Notice that this is about character sets, not character encodings. 
> >You could have
> >a Form string "UTF8" for files, with a Wide_Text_IO version that understood 
> >it.  Same
> >for "Unicode" and so on.  The program would only see ISO-10646 characters, 
> >but the
> >generated files would be much smaller. ;-)
> 
> Or you could use Ada.Strings.Encodings to convert the string to UTF-8 and 
> then use Ada.Text_IO to output it. (This is an end-run round strong typing, 
> sadly, but it works.)
> 
>                                    Randy.
> 
> P.S. Robert, nice to hear from you again. It's been a while, hope you're 
> doing well.

Hi,

My approach is to encode myself and write the encoded Character via Text_IO, reserving -gnatW8 for just those files containing UTF-8 data.

It does, however, feel like there is something missing where it's "difficult" to have a Wide_String literal without having to have extra meta data for compiler (-gnatW8) or having a relatively cumbersome concatenation of Wide_Character's based on code points.  BTW, the performance of GNAT for such a concatenated string is pretty dismal.

Not really advocating the C/C++ style \ escaping, e.g., \x, \u, \U, but it would be "nice" to express such constant strings easily.  It was mentioned that Wide_Character'Val requires elaboration.  Presumably, a compiler should be able to optimize it away but I'm not sure if it's allowed to do that?

Take care,
Michael.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-28 22:48     ` Michael Rohan
@ 2016-03-29  7:44       ` Dmitry A. Kazakov
  2016-03-29  8:39       ` G.B.
  2016-03-29 22:35       ` Randy Brukardt
  2 siblings, 0 replies; 12+ messages in thread
From: Dmitry A. Kazakov @ 2016-03-29  7:44 UTC (permalink / raw)


On 29/03/2016 00:48, Michael Rohan wrote:

> My approach is to encode myself and write the encoded Character via
> Text_IO,

Which is a bad idea, you know. Text_IO is not guaranteed to support 
UTF-8 encoding. Actually rather the opposite since Character is declared 
Latin-1. You should use Stream_IO instead of Text_IO.

> It does, however, feel like there is something missing where it's
> "difficult" to have a Wide_String literal without having to have extra
> meta data for compiler (-gnatW8) or having a relatively cumbersome
> concatenation of Wide_Character's based on code points.

Why don't you use UTF-8 strings instead?

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-28 22:48     ` Michael Rohan
  2016-03-29  7:44       ` Dmitry A. Kazakov
@ 2016-03-29  8:39       ` G.B.
  2016-03-29 22:35       ` Randy Brukardt
  2 siblings, 0 replies; 12+ messages in thread
From: G.B. @ 2016-03-29  8:39 UTC (permalink / raw)


On 29.03.16 00:48, Michael Rohan wrote:
> It does, however, feel like there is something missing where it's "difficult" to have a Wide_String literal without having to have extra meta data for compiler (-gnatW8) or having a relatively cumbersome concatenation of Wide_Character's based on code points.  BTW, the performance of GNAT for such a concatenated string is pretty dismal.

That source character set issue is probably resolved in no time
once NATO, inasmuch as it is a software organization, will require
that compilers automatically detect UTF-16.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-28 22:48     ` Michael Rohan
  2016-03-29  7:44       ` Dmitry A. Kazakov
  2016-03-29  8:39       ` G.B.
@ 2016-03-29 22:35       ` Randy Brukardt
  2016-04-04 10:52         ` G.B.
  2 siblings, 1 reply; 12+ messages in thread
From: Randy Brukardt @ 2016-03-29 22:35 UTC (permalink / raw)


"Michael Rohan" <michael@zanyblue.com> wrote in message 
news:6406289c-06a8-46d1-a633-8a1c8a22f79b@googlegroups.com...
...
>It does, however, feel like there is something missing where it's 
>"difficult" to have
>a Wide_String literal without having to have extra meta data for compiler 
>(-gnatW8)
>or having a relatively cumbersome concatenation of Wide_Character's based 
>on
>code points.  BTW, the performance of GNAT for such a concatenated string 
>is
>pretty dismal.

Both of these are clearly implementation issues as opposed to language 
issues. The language standard can have nothing to say about what incantation 
it takes to compile a program, and how one identifies the source format is 
just a small part of that. (As I previously said, a new compiler would 
probably make UTF-8 format the default, but changing the default on an 
existing compiler could cause trouble for many existing customers - I 
wouldn't expect such a change to be made lightly.) Usability of a compiler 
is completely out of bounds for a language standard (*any* language 
standard). (Janus/Ada doesn't support character values > 255 in any format; 
it still conforms to the older Ada Standards.)

And performance, of course, is clearly an implementation issue. (I also 
would be quite surprised if that was an issue for truly constant strings. I 
could see how it might be an issue if part of the string is calculated from 
some variable, but since Ada defines and requires static strings in some 
circumstances, an Ada compiler has the machinery to avoid any runtime costs 
for static string expressions.)

In any case, GNAT /= Ada.

                             Randy.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-03-29 22:35       ` Randy Brukardt
@ 2016-04-04 10:52         ` G.B.
  2016-04-05  0:39           ` Randy Brukardt
  0 siblings, 1 reply; 12+ messages in thread
From: G.B. @ 2016-04-04 10:52 UTC (permalink / raw)

On 30.03.16 00:35, Randy Brukardt wrote:
> "Michael Rohan" <michael@zanyblue.com> wrote in message
> news:6406289c-06a8-46d1-a633-8a1c8a22f79b@googlegroups.com...
> ...
>> It does, however, feel like there is something missing where it's
>> "difficult" to have
>> a Wide_String literal without having to have extra meta data for compiler
>> (-gnatW8)
>> or having a relatively cumbersome concatenation of Wide_Character's based
>> on
>> code points.  BTW, the performance of GNAT for such a concatenated string
>> is
>> pretty dismal.
>
> Both of these are clearly implementation issues as opposed to language
> issues.

Ada users would expect to be able to express numeric literals,
I think, and without any implementation issues whatsoever.
This includes numeric capacity, which they expect
a compiler to report correctly, which implies no implementation
issues when parsing numeric literals.

However —I'm guessing— there is embarrassment lurking behind
handling non-ASCII strings:
it mostly hinges on the pampered, old misunderstanding that char has
eight bits, 7 of which are to be used, and each is fixed to represent
one ASCII character. Hence, trying to handle more than that in
any tool, including a compiler reading a source unit, is
deemed equivalent to tackling a hard problem of number theory.

No one would tolerate that kind of allegation of complexity
of handling contemporary character sets, historically grown as it
may be, for numeric literals of Ada. There is room for compromise
when ISO-ing source character sets, I would hope just like there is
room for compromise when a compiler is not required to solve problems
of number theory when lexing and parsing numeric literals.

C++ has a related problem with string literals. It costs customers'
time and money.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UTF-8 Output and "-gnatW8"
  2016-04-04 10:52         ` G.B.
@ 2016-04-05  0:39           ` Randy Brukardt
  0 siblings, 0 replies; 12+ messages in thread
From: Randy Brukardt @ 2016-04-05  0:39 UTC (permalink / raw)

"G.B." <bauhaus@futureapps.invalid> wrote in message 
news:ndtgqj$tah$1@dont-email.me...
> On 30.03.16 00:35, Randy Brukardt wrote:
...
> However -I'm guessing- there is embarrassment lurking behind
> handling non-ASCII strings:
> it mostly hinges on the pampered, old misunderstanding that char has
> eight bits, 7 of which are to be used, and each is fixed to represent
> one ASCII character.

Not really. The problem stems from Ada 95 using Latin-1 as the primary 
character set; most Ada 95 compilers accept Latin-1 source code where all 
8-bits are used. There is a lot of such source code in the wild.

UTF-8 represents characters over position 127 as two bytes (as opposed to 
one). There is no possible automatic way to tell between these 
representations, as any legal UTF-8 representation of Latin-1 characters 
also has a (different) meaning if read as Latin-1.

Thus, a compiler that needs to take both formats (like GNAT), needs to be 
told which format it is. Most compilers have a default format (probably 
Latin-1 from Ada 95), and changing that default would break a lot of 
customer's existing compilation scripts. So no vendor would do that, after 
all, it's easier to keep an existing customer than to get a new one. There's 
no benefit to pissing them off.

A brand-new built-from-scratch compiler would almost certainly default to 
UTF-8 (that being the Ada 2012 default format).

> Hence, trying to handle more than that in
> any tool, including a compiler reading a source unit, is
> deemed equivalent to tackling a hard problem of number theory.

When you have two identical byte streams that the user intends to mean 
different things, it is clearly impossible for any tool to differentiate 
them. Something external has to describe the format. (It would be nice of 
commonly used OSes included this information, but they don't.)

And what the heck does numeric literals have to do this this anyway? The OP 
was having problems with string literals. (Ada doesn't allow any non-ascii 
characters in numeric literals anyway, it's the identifiers, string 
literals, and comments that cause the issue.)

                               Randy. 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-04-05  0:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-24 17:23 UTF-8 Output and "-gnatW8" Michael Rohan
2016-03-24 22:09 ` Randy Brukardt
2016-03-24 22:34   ` Michael Rohan
2016-03-25 19:15     ` Randy Brukardt
2016-03-25  5:54 ` rieachus
2016-03-25 19:18   ` Randy Brukardt
2016-03-28 22:48     ` Michael Rohan
2016-03-29  7:44       ` Dmitry A. Kazakov
2016-03-29  8:39       ` G.B.
2016-03-29 22:35       ` Randy Brukardt
2016-04-04 10:52         ` G.B.
2016-04-05  0:39           ` Randy Brukardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox