From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.66.102.8 with SMTP id fk8mr6898781pab.44.1458858873763; Thu, 24 Mar 2016 15:34:33 -0700 (PDT) X-Received: by 10.182.125.37 with SMTP id mn5mr109371obb.10.1458858873603; Thu, 24 Mar 2016 15:34:33 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!av4no1900482igc.0!news-out.google.com!u9ni4733igk.0!nntp.google.com!nt3no4264956igb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Thu, 24 Mar 2016 15:34:33 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=208.91.1.34; posting-account=1YPeQwoAAACAk-xhKPD32B0GIDdsFFtk NNTP-Posting-Host: 208.91.1.34 References: <35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <4f157cc0-1d3c-46d7-ab19-88a13ae0afd0@googlegroups.com> Subject: Re: UTF-8 Output and "-gnatW8" From: Michael Rohan Injection-Date: Thu, 24 Mar 2016 22:34:33 +0000 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Xref: news.eternal-september.org comp.lang.ada:29883 Date: 2016-03-24T15:34:33-07:00 List-Id: Hi, OK, so this might be a compiler bug. The RM states the character set shoul= d be ISO 10646 so EBCDIC would seem to be something that is not allowed. The implementation for GNAT impacts the handling of strings, e.g., S : constant Wide_String :=3D "=CF=80"; With "-gnatW8" this is correctly interpreted as a string of length 1 containing the character U+03C0. Without the "-gnatW8" option, GNAT interprets it as a string of Characters to convert to a Wide_String, i.e., the two character U+00CF and U+0080 Is the constant string value ambiguous here? Take care, Michael. On Thursday, March 24, 2016 at 3:09:33 PM UTC-7, Randy Brukardt wrote: > "Michael Rohan" wrote in message=20 > news:35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com... > ... > >I've been using this option is state that my source files are UTF-8 enco= ded=20 > >but > >I don't particular want to change the behaviour of the Ada.Text_IO=20 > >routines. >=20 > I don't see any reason that the character encoding option ought to change= =20 > the runtime behavior of anything - it ought to just tell the compiler abo= ut=20 > the form of the source code. But I'm definitely not an expert in GNAT. >=20 > > I don't see an option that covers just the source file encoding withou= t=20 > > impacting the Text_IO (narrow) functionality. >=20 > I don't see anything in the documentation you posted that it has any effe= ct=20 > on Text_IO, nor would I expect it to, since it says it controls the=20 > representation of Wide_Characters, and there are no wide characters=20 > associated with Text_IO. >=20 > >It's pretty easy to see this. Here's an already UTF-8 encoded string=20 > >example: > > > >with Ada.Text_IO; > >procedure PiDay is > >begin > > Ada.Text_IO.Put_Line ( > > "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day.= "); > >end PiDay; >=20 > Since this program text doesn't include any wide characters, there should= be=20 > no effect on the behavior of Text_IO. >=20 > I think what you are seeing is just a bug; I'd suggest report it as a bug= to=20 > AdaCore and see what they say. (Even if they intended something to happen= =20 > here, it seems to be a horribly bad idea.) My guess is that they are fold= ing=20 > the string literal and then encoding that into UTF-8, even though such=20 > encoding is too late. >=20 > >The RM includes an "Implementation Requirement": > > > >16/3 > > An Ada implementation shall accept Ada source code in UTF-8 encoding, w= ith=20 > > or > > without a BOM (see A.4.11), where every character is represented by its= =20 > > code > > point. The character pair CARRIAGE RETURN/LINE FEED (code points > >16#0D# 16#0A#) signifies a single end of line (see 2.2); every other=20 > >occurrence > > of a format_effector other than the character whose code point position= is=20 > > 16#09# > > (CHARACTER TABULATION) also signifies a single end of line. >=20 > Two points here: >=20 > (1) The Ada Standard requires no other encoding. The expectation is that = in=20 > the long term, all Ada (portable) source code will be encoded in UTF-8.= =20 > There's no requirement for a compiler to support anything else, and the o= nly=20 > need beyond that is to process legacy code -- a tool similar to GNATChop= =20 > could handle that without messing up the compiler. (Note that the ACATS i= s=20 > provided only in 7-bit ASCII and UTF-8 encoded files, and the former is a= =20 > subset of the latter.) >=20 > (2) This is *only* about the source encoding. It has no effect on anythin= g=20 > beyond the lexical level of an Ada program. In particular, it has no effe= ct=20 > on any runtime behavior. Indeed, source encoding is so different than=20 > anything specified in the Ada Standard that in previous versions of Ada, = it=20 > wasn't specified at all. Source encoding, other than the UTF-8 encoding= =20 > defined in the Standard, is inherently implementation-defined, because th= e=20 > intepretation of the encoding has to happen before any Ada rules can be= =20 > applied (from lexical and syntax rules on down). >=20 > >It feels like we should be able to explicitly define the encoding for a= =20 > >source via pragma: > > > > pragma Character_Set ("UTF-8"); >=20 > This is clearly pointless: > (1) As noted above, the only required source encoding is UTF-8. If you ne= ed=20 > portable code, there is no other choice, and if you don't, you don't need= a=20 > portable way to specify it. > (2) It should be obvious that a pragma is too late. Since such a pragma i= s=20 > inside of the source code, and encoded using whatever encoding, by the ti= me=20 > the compiler recognizes it, it has already been assuming an encoding. And= it=20 > if assumed wrong, it probably couldn't recognize it at all (consider sour= ce=20 > code in EBCDIC or even UCS-2/UTF-16). So at best, it could confirm what t= he=20 > compiler already knows. And since it has to be optional (obviously, no=20 > existing Ada source code has such a pragma), the absence of it doesn't te= ll=20 > the compiler anything, either. >=20 > So, moral of the story: (A) Use only UTF-8 for portable Ada 2012 code; (B= )=20 > complain to your vendor if the encoding does anything other than determin= e=20 > the source code encoding. >=20 > Randy.