Re: UTF-8 Output and "-gnatW8"

comp.lang.ada
 help / color / mirror / Atom feed

From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: UTF-8 Output and "-gnatW8"
Date: Fri, 25 Mar 2016 14:15:00 -0500
Date: 2016-03-25T14:15:00-05:00	[thread overview]
Message-ID: <nd42nl$fj5$1@loke.gir.dk> (raw)
In-Reply-To: 4f157cc0-1d3c-46d7-ab19-88a13ae0afd0@googlegroups.com

"Michael Rohan" <michael@zanyblue.com> wrote in message 
news:4f157cc0-1d3c-46d7-ab19-88a13ae0afd0@googlegroups.com...
>Hi,
>
>OK, so this might be a compiler bug.  The RM states the character set 
>should
>be ISO 10646 so EBCDIC would seem to be something that is not allowed.

Ah, that's a common mistake. The RM specifies what the *runtime* character 
set it. Prior to Ada 2012, it said *nothing* about the encoding of Ada 
source code, and even now, it only talks about UTF-8 as one possibility for 
that encoding. Anything else is allowed, including EBCDIC, Shift-JIS, or 
even some sort of tree (the latter is explicitly mentioned as a possibility 
in the AARM). Someone even suggested a source representation where '{' = 
"begin", "} = "end", etc. (It was that suggestion that finally got the UTF-8 
"standard" encoding into the Standard, to provide real interoperability for 
Ada source code.)

>The implementation for GNAT impacts the handling of strings, e.g.,
>
>S : constant Wide_String := "?";
>
>With "-gnatW8" this is correctly interpreted as a string of length 1
>containing the character U+03C0.  Without the "-gnatW8" option, GNAT
>interprets it as a string of Characters to convert to a Wide_String,
>i.e., the two character U+00CF and U+0080

That seems right to me. (In a new compiler, I'd make UTF-8 the default, but 
any existing compiler probably would have to make it a switch of some sort.) 
But that's because you have a UTF-8 character in the source code.

The bug is that you said that some source code with no explicit UTF-8 
characters (rather representing them as Character'Val(16#C0#) and the like) 
was changing behavior in response to such a switch. That's a bug in my 
view(Character'Val(16#C0#) isn't a character literal at compile-time, it's a 
function call, and it's representation is the same regardless of whether the 
source is read as 7-bit ASCII or UTF-8).

>Is the constant string value ambiguous here?

It means something different depending upon the source representation. I 
belive GNAT is getting that correct.

"Character'Val(16#C0#)" means the same thing in either source 
representation, so you should get the same results for the program 
containing that. If you don't, that's a bug.

Hope this clears it up.

                                    Randy.

.

next prev parent reply	other threads:[~2016-03-25 19:15 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-24 17:23 UTF-8 Output and "-gnatW8" Michael Rohan
2016-03-24 22:09 ` Randy Brukardt
2016-03-24 22:34   ` Michael Rohan
2016-03-25 19:15     ` Randy Brukardt [this message]
2016-03-25  5:54 ` rieachus
2016-03-25 19:18   ` Randy Brukardt
2016-03-28 22:48     ` Michael Rohan
2016-03-29  7:44       ` Dmitry A. Kazakov
2016-03-29  8:39       ` G.B.
2016-03-29 22:35       ` Randy Brukardt
2016-04-04 10:52         ` G.B.
2016-04-05  0:39           ` Randy Brukardt

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox