From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.66.102.8 with SMTP id fk8mr6898781pab.44.1458858873763;
        Thu, 24 Mar 2016 15:34:33 -0700 (PDT)
X-Received: by 10.182.125.37 with SMTP id mn5mr109371obb.10.1458858873603;
 Thu, 24 Mar 2016 15:34:33 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!av4no1900482igc.0!news-out.google.com!u9ni4733igk.0!nntp.google.com!nt3no4264956igb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Thu, 24 Mar 2016 15:34:33 -0700 (PDT)
In-Reply-To: <nd1oir$c8r$1@loke.gir.dk>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=208.91.1.34;
 posting-account=1YPeQwoAAACAk-xhKPD32B0GIDdsFFtk
NNTP-Posting-Host: 208.91.1.34
References: <35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com>
 <nd1oir$c8r$1@loke.gir.dk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <4f157cc0-1d3c-46d7-ab19-88a13ae0afd0@googlegroups.com>
Subject: Re: UTF-8 Output and "-gnatW8"
From: Michael Rohan <michael@zanyblue.com>
Injection-Date: Thu, 24 Mar 2016 22:34:33 +0000
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Xref: news.eternal-september.org comp.lang.ada:29883
Date: 2016-03-24T15:34:33-07:00
List-Id: <comp.lang.ada>

Hi,

OK, so this might be a compiler bug.  The RM states the character set shoul=
d
be ISO 10646 so EBCDIC would seem to be something that is not allowed.

The implementation for GNAT impacts the handling of strings, e.g.,

S : constant Wide_String :=3D "=CF=80";

With "-gnatW8" this is correctly interpreted as a string of length 1
containing the character U+03C0.  Without the "-gnatW8" option, GNAT
interprets it as a string of Characters to convert to a Wide_String,
i.e., the two character U+00CF and U+0080

Is the constant string value ambiguous here?

Take care,
Michael.

On Thursday, March 24, 2016 at 3:09:33 PM UTC-7, Randy Brukardt wrote:
> "Michael Rohan" <michael@zanyblue.com> wrote in message=20
> news:35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com...
> ...
> >I've been using this option is state that my source files are UTF-8 enco=
ded=20
> >but
> >I don't particular want to change the behaviour of the Ada.Text_IO=20
> >routines.
>=20
> I don't see any reason that the character encoding option ought to change=
=20
> the runtime behavior of anything - it ought to just tell the compiler abo=
ut=20
> the form of the source code. But I'm definitely not an expert in GNAT.
>=20
> >  I don't see an option that covers just the source file encoding withou=
t=20
> > impacting the Text_IO (narrow) functionality.
>=20
> I don't see anything in the documentation you posted that it has any effe=
ct=20
> on Text_IO, nor would I expect it to, since it says it controls the=20
> representation of Wide_Characters, and there are no wide characters=20
> associated with Text_IO.
>=20
> >It's pretty easy to see this.  Here's an already UTF-8 encoded string=20
> >example:
> >
> >with Ada.Text_IO;
> >procedure PiDay is
> >begin
> >   Ada.Text_IO.Put_Line (
> >      "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day.=
");
> >end PiDay;
>=20
> Since this program text doesn't include any wide characters, there should=
 be=20
> no effect on the behavior of Text_IO.
>=20
> I think what you are seeing is just a bug; I'd suggest report it as a bug=
 to=20
> AdaCore and see what they say. (Even if they intended something to happen=
=20
> here, it seems to be a horribly bad idea.) My guess is that they are fold=
ing=20
> the string literal and then encoding that into UTF-8, even though such=20
> encoding is too late.
>=20
> >The RM includes an "Implementation Requirement":
> >
> >16/3
> > An Ada implementation shall accept Ada source code in UTF-8 encoding, w=
ith=20
> > or
> > without a BOM (see A.4.11), where every character is represented by its=
=20
> > code
> > point. The character pair CARRIAGE RETURN/LINE FEED (code points
> >16#0D# 16#0A#) signifies a single end of line (see 2.2); every other=20
> >occurrence
> > of a format_effector other than the character whose code point position=
 is=20
> > 16#09#
> > (CHARACTER TABULATION) also signifies a single end of line.
>=20
> Two points here:
>=20
> (1) The Ada Standard requires no other encoding. The expectation is that =
in=20
> the long term, all Ada (portable) source code will be encoded in UTF-8.=
=20
> There's no requirement for a compiler to support anything else, and the o=
nly=20
> need beyond that is to process legacy code -- a tool similar to GNATChop=
=20
> could handle that without messing up the compiler. (Note that the ACATS i=
s=20
> provided only in 7-bit ASCII and UTF-8 encoded files, and the former is a=
=20
> subset of the latter.)
>=20
> (2) This is *only* about the source encoding. It has no effect on anythin=
g=20
> beyond the lexical level of an Ada program. In particular, it has no effe=
ct=20
> on any runtime behavior. Indeed, source encoding is so different than=20
> anything specified in the Ada Standard that in previous versions of Ada, =
it=20
> wasn't specified at all. Source encoding, other than the UTF-8 encoding=
=20
> defined in the Standard, is inherently implementation-defined, because th=
e=20
> intepretation of the encoding has to happen before any Ada rules can be=
=20
> applied (from lexical and syntax rules on down).
>=20
> >It feels like we should be able to explicitly define the encoding for a=
=20
> >source via pragma:
> >
> >    pragma Character_Set ("UTF-8");
>=20
> This is clearly pointless:
> (1) As noted above, the only required source encoding is UTF-8. If you ne=
ed=20
> portable code, there is no other choice, and if you don't, you don't need=
 a=20
> portable way to specify it.
> (2) It should be obvious that a pragma is too late. Since such a pragma i=
s=20
> inside of the source code, and encoded using whatever encoding, by the ti=
me=20
> the compiler recognizes it, it has already been assuming an encoding. And=
 it=20
> if assumed wrong, it probably couldn't recognize it at all (consider sour=
ce=20
> code in EBCDIC or even UCS-2/UTF-16). So at best, it could confirm what t=
he=20
> compiler already knows. And since it has to be optional (obviously, no=20
> existing Ada source code has such a pragma), the absence of it doesn't te=
ll=20
> the compiler anything, either.
>=20
> So, moral of the story: (A) Use only UTF-8 for portable Ada 2012 code; (B=
)=20
> complain to your vendor if the encoding does anything other than determin=
e=20
> the source code encoding.
>=20
>                                     Randy.