Re: UTF-8 Output and "-gnatW8"

comp.lang.ada
 help / color / mirror / Atom feed

From: Michael Rohan <michael@zanyblue.com>
Subject: Re: UTF-8 Output and "-gnatW8"
Date: Thu, 24 Mar 2016 15:34:33 -0700 (PDT)
Date: 2016-03-24T15:34:33-07:00	[thread overview]
Message-ID: <4f157cc0-1d3c-46d7-ab19-88a13ae0afd0@googlegroups.com> (raw)
In-Reply-To: <nd1oir$c8r$1@loke.gir.dk>

Hi,

OK, so this might be a compiler bug.  The RM states the character set should
be ISO 10646 so EBCDIC would seem to be something that is not allowed.

The implementation for GNAT impacts the handling of strings, e.g.,

S : constant Wide_String := "π";

With "-gnatW8" this is correctly interpreted as a string of length 1
containing the character U+03C0.  Without the "-gnatW8" option, GNAT
interprets it as a string of Characters to convert to a Wide_String,
i.e., the two character U+00CF and U+0080

Is the constant string value ambiguous here?

Take care,
Michael.

On Thursday, March 24, 2016 at 3:09:33 PM UTC-7, Randy Brukardt wrote:
> "Michael Rohan" <michael@zanyblue.com> wrote in message 
> news:35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com...
> ...
> >I've been using this option is state that my source files are UTF-8 encoded 
> >but
> >I don't particular want to change the behaviour of the Ada.Text_IO 
> >routines.
> 
> I don't see any reason that the character encoding option ought to change 
> the runtime behavior of anything - it ought to just tell the compiler about 
> the form of the source code. But I'm definitely not an expert in GNAT.
> 
> >  I don't see an option that covers just the source file encoding without 
> > impacting the Text_IO (narrow) functionality.
> 
> I don't see anything in the documentation you posted that it has any effect 
> on Text_IO, nor would I expect it to, since it says it controls the 
> representation of Wide_Characters, and there are no wide characters 
> associated with Text_IO.
> 
> >It's pretty easy to see this.  Here's an already UTF-8 encoded string 
> >example:
> >
> >with Ada.Text_IO;
> >procedure PiDay is
> >begin
> >   Ada.Text_IO.Put_Line (
> >      "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day.");
> >end PiDay;
> 
> Since this program text doesn't include any wide characters, there should be 
> no effect on the behavior of Text_IO.
> 
> I think what you are seeing is just a bug; I'd suggest report it as a bug to 
> AdaCore and see what they say. (Even if they intended something to happen 
> here, it seems to be a horribly bad idea.) My guess is that they are folding 
> the string literal and then encoding that into UTF-8, even though such 
> encoding is too late.
> 
> >The RM includes an "Implementation Requirement":
> >
> >16/3
> > An Ada implementation shall accept Ada source code in UTF-8 encoding, with 
> > or
> > without a BOM (see A.4.11), where every character is represented by its 
> > code
> > point. The character pair CARRIAGE RETURN/LINE FEED (code points
> >16#0D# 16#0A#) signifies a single end of line (see 2.2); every other 
> >occurrence
> > of a format_effector other than the character whose code point position is 
> > 16#09#
> > (CHARACTER TABULATION) also signifies a single end of line.
> 
> Two points here:
> 
> (1) The Ada Standard requires no other encoding. The expectation is that in 
> the long term, all Ada (portable) source code will be encoded in UTF-8. 
> There's no requirement for a compiler to support anything else, and the only 
> need beyond that is to process legacy code -- a tool similar to GNATChop 
> could handle that without messing up the compiler. (Note that the ACATS is 
> provided only in 7-bit ASCII and UTF-8 encoded files, and the former is a 
> subset of the latter.)
> 
> (2) This is *only* about the source encoding. It has no effect on anything 
> beyond the lexical level of an Ada program. In particular, it has no effect 
> on any runtime behavior. Indeed, source encoding is so different than 
> anything specified in the Ada Standard that in previous versions of Ada, it 
> wasn't specified at all. Source encoding, other than the UTF-8 encoding 
> defined in the Standard, is inherently implementation-defined, because the 
> intepretation of the encoding has to happen before any Ada rules can be 
> applied (from lexical and syntax rules on down).
> 
> >It feels like we should be able to explicitly define the encoding for a 
> >source via pragma:
> >
> >    pragma Character_Set ("UTF-8");
> 
> This is clearly pointless:
> (1) As noted above, the only required source encoding is UTF-8. If you need 
> portable code, there is no other choice, and if you don't, you don't need a 
> portable way to specify it.
> (2) It should be obvious that a pragma is too late. Since such a pragma is 
> inside of the source code, and encoded using whatever encoding, by the time 
> the compiler recognizes it, it has already been assuming an encoding. And it 
> if assumed wrong, it probably couldn't recognize it at all (consider source 
> code in EBCDIC or even UCS-2/UTF-16). So at best, it could confirm what the 
> compiler already knows. And since it has to be optional (obviously, no 
> existing Ada source code has such a pragma), the absence of it doesn't tell 
> the compiler anything, either.
> 
> So, moral of the story: (A) Use only UTF-8 for portable Ada 2012 code; (B) 
> complain to your vendor if the encoding does anything other than determine 
> the source code encoding.
> 
>                                     Randy.

next prev parent reply	other threads:[~2016-03-24 22:34 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-24 17:23 UTF-8 Output and "-gnatW8" Michael Rohan
2016-03-24 22:09 ` Randy Brukardt
2016-03-24 22:34   ` Michael Rohan [this message]
2016-03-25 19:15     ` Randy Brukardt
2016-03-25  5:54 ` rieachus
2016-03-25 19:18   ` Randy Brukardt
2016-03-28 22:48     ` Michael Rohan
2016-03-29  7:44       ` Dmitry A. Kazakov
2016-03-29  8:39       ` G.B.
2016-03-29 22:35       ` Randy Brukardt
2016-04-04 10:52         ` G.B.
2016-04-05  0:39           ` Randy Brukardt

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox