From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.13.194.67 with SMTP id e64mr5761089ywd.17.1458840232176; Thu, 24 Mar 2016 10:23:52 -0700 (PDT) X-Received: by 10.182.113.198 with SMTP id ja6mr98213obb.0.1458840232130; Thu, 24 Mar 2016 10:23:52 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!y89no9196103qge.0!news-out.google.com!pn7ni16749igb.0!nntp.google.com!nt3no4185626igb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Thu, 24 Mar 2016 10:23:51 -0700 (PDT) Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=208.91.2.3; posting-account=1YPeQwoAAACAk-xhKPD32B0GIDdsFFtk NNTP-Posting-Host: 208.91.2.3 User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com> Subject: UTF-8 Output and "-gnatW8" From: Michael Rohan Injection-Date: Thu, 24 Mar 2016 17:23:52 +0000 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Xref: news.eternal-september.org comp.lang.ada:29880 Date: 2016-03-24T10:23:51-07:00 List-Id: Hi Folks, I'm seeing, what I suspect, is a GNAT run-time encoding of an already encod= ed UTF-8 string when "-gnatW8" option is used. The help info on "-gnatW8" = states -gnatW? Wide character encoding method (?=3Dh/u/s/e/8/b) I've been using this option is state that my source files are UTF-8 encoded= but I don't particular want to change the behaviour of the Ada.Text_IO rou= tines. I don't see an option that covers just the source file encoding wit= hout impacting the Text_IO (narrow) functionality. I'm going to adjust my build process to only used "-gnatW8" when compiling = sources that contain non-ASCII, UTF-8 characters. It's pretty easy to see this. Here's an already UTF-8 encoded string examp= le: with Ada.Text_IO; procedure PiDay is begin Ada.Text_IO.Put_Line ( "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day."); end PiDay; Building and executing with and without "-gnatW8" gives $ gnatmake piday gcc -c piday.adb gnatbind -x piday.ali gnatlink piday.ali $ ./piday=20 It's =CF=80 day. $ touch piday.adb=20 $ gnatmake -gnatW8 piday gcc -c -gnatW8 piday.adb gnatbind -x piday.ali gnatlink piday.ali $ ./piday=20 It's =C3=8F=C2=80 day. The RM includes an "Implementation Requirement": 16/3 An Ada implementation shall accept Ada source code in UTF-8 encoding, with= or without a BOM (see A.4.11), where every character is represented by its= code point. The character pair CARRIAGE RETURN/LINE FEED (code points 16#0= D# 16#0A#) signifies a single end of line (see 2.2); every other occurrence= of a format_effector other than the character whose code point position is= 16#09# (CHARACTER TABULATION) also signifies a single end of line. It feels like we should be able to explicitly define the encoding for a sou= rce via pragma: pragma Character_Set ("UTF-8"); Take care, Michael.