From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.13.194.67 with SMTP id e64mr5761089ywd.17.1458840232176;
        Thu, 24 Mar 2016 10:23:52 -0700 (PDT)
X-Received: by 10.182.113.198 with SMTP id ja6mr98213obb.0.1458840232130; Thu,
 24 Mar 2016 10:23:52 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!y89no9196103qge.0!news-out.google.com!pn7ni16749igb.0!nntp.google.com!nt3no4185626igb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Thu, 24 Mar 2016 10:23:51 -0700 (PDT)
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=208.91.2.3;
 posting-account=1YPeQwoAAACAk-xhKPD32B0GIDdsFFtk
NNTP-Posting-Host: 208.91.2.3
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com>
Subject: UTF-8 Output and "-gnatW8"
From: Michael Rohan <michael@zanyblue.com>
Injection-Date: Thu, 24 Mar 2016 17:23:52 +0000
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Xref: news.eternal-september.org comp.lang.ada:29880
Date: 2016-03-24T10:23:51-07:00
List-Id: <comp.lang.ada>

Hi Folks,

I'm seeing, what I suspect, is a GNAT run-time encoding of an already encod=
ed UTF-8 string when "-gnatW8" option is used.  The help info on "-gnatW8" =
states

-gnatW?   Wide character encoding method (?=3Dh/u/s/e/8/b)

I've been using this option is state that my source files are UTF-8 encoded=
 but I don't particular want to change the behaviour of the Ada.Text_IO rou=
tines.  I don't see an option that covers just the source file encoding wit=
hout impacting the Text_IO (narrow) functionality.

I'm going to adjust my build process to only used "-gnatW8" when compiling =
sources that contain non-ASCII, UTF-8 characters.

It's pretty easy to see this.  Here's an already UTF-8 encoded string examp=
le:

with Ada.Text_IO;
procedure PiDay is
begin
   Ada.Text_IO.Put_Line (
      "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day.");
end PiDay;

Building and executing with and without "-gnatW8" gives

$ gnatmake piday
gcc -c piday.adb
gnatbind -x piday.ali
gnatlink piday.ali
$ ./piday=20
It's =CF=80 day.
$ touch piday.adb=20
$ gnatmake -gnatW8 piday
gcc -c -gnatW8 piday.adb
gnatbind -x piday.ali
gnatlink piday.ali
$ ./piday=20
It's =C3=8F=C2=80 day.

The RM includes an "Implementation Requirement":

16/3
 An Ada implementation shall accept Ada source code in UTF-8 encoding, with=
 or without a BOM (see A.4.11), where every character is represented by its=
 code point. The character pair CARRIAGE RETURN/LINE FEED (code points 16#0=
D# 16#0A#) signifies a single end of line (see 2.2); every other occurrence=
 of a format_effector other than the character whose code point position is=
 16#09# (CHARACTER TABULATION) also signifies a single end of line.

It feels like we should be able to explicitly define the encoding for a sou=
rce via pragma:

    pragma Character_Set ("UTF-8");

Take care,
Michael.