From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!gandalf.srv.welterde.de!news.jacob-sparre.dk!loke.jacob-sparre.dk!pnx.dk!.POSTED!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: UTF-8 Output and "-gnatW8" Date: Thu, 24 Mar 2016 17:09:31 -0500 Organization: JSA Research & Innovation Message-ID: References: <35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com> NNTP-Posting-Host: rrsoftware.com X-Trace: loke.gir.dk 1458857371 12571 24.196.82.226 (24 Mar 2016 22:09:31 GMT) X-Complaints-To: news@jacob-sparre.dk NNTP-Posting-Date: Thu, 24 Mar 2016 22:09:31 +0000 (UTC) X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Original X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Xref: news.eternal-september.org comp.lang.ada:29882 Date: 2016-03-24T17:09:31-05:00 List-Id: "Michael Rohan" wrote in message news:35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com... ... >I've been using this option is state that my source files are UTF-8 encoded >but >I don't particular want to change the behaviour of the Ada.Text_IO >routines. I don't see any reason that the character encoding option ought to change the runtime behavior of anything - it ought to just tell the compiler about the form of the source code. But I'm definitely not an expert in GNAT. > I don't see an option that covers just the source file encoding without > impacting the Text_IO (narrow) functionality. I don't see anything in the documentation you posted that it has any effect on Text_IO, nor would I expect it to, since it says it controls the representation of Wide_Characters, and there are no wide characters associated with Text_IO. >It's pretty easy to see this. Here's an already UTF-8 encoded string >example: > >with Ada.Text_IO; >procedure PiDay is >begin > Ada.Text_IO.Put_Line ( > "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day."); >end PiDay; Since this program text doesn't include any wide characters, there should be no effect on the behavior of Text_IO. I think what you are seeing is just a bug; I'd suggest report it as a bug to AdaCore and see what they say. (Even if they intended something to happen here, it seems to be a horribly bad idea.) My guess is that they are folding the string literal and then encoding that into UTF-8, even though such encoding is too late. >The RM includes an "Implementation Requirement": > >16/3 > An Ada implementation shall accept Ada source code in UTF-8 encoding, with > or > without a BOM (see A.4.11), where every character is represented by its > code > point. The character pair CARRIAGE RETURN/LINE FEED (code points >16#0D# 16#0A#) signifies a single end of line (see 2.2); every other >occurrence > of a format_effector other than the character whose code point position is > 16#09# > (CHARACTER TABULATION) also signifies a single end of line. Two points here: (1) The Ada Standard requires no other encoding. The expectation is that in the long term, all Ada (portable) source code will be encoded in UTF-8. There's no requirement for a compiler to support anything else, and the only need beyond that is to process legacy code -- a tool similar to GNATChop could handle that without messing up the compiler. (Note that the ACATS is provided only in 7-bit ASCII and UTF-8 encoded files, and the former is a subset of the latter.) (2) This is *only* about the source encoding. It has no effect on anything beyond the lexical level of an Ada program. In particular, it has no effect on any runtime behavior. Indeed, source encoding is so different than anything specified in the Ada Standard that in previous versions of Ada, it wasn't specified at all. Source encoding, other than the UTF-8 encoding defined in the Standard, is inherently implementation-defined, because the intepretation of the encoding has to happen before any Ada rules can be applied (from lexical and syntax rules on down). >It feels like we should be able to explicitly define the encoding for a >source via pragma: > > pragma Character_Set ("UTF-8"); This is clearly pointless: (1) As noted above, the only required source encoding is UTF-8. If you need portable code, there is no other choice, and if you don't, you don't need a portable way to specify it. (2) It should be obvious that a pragma is too late. Since such a pragma is inside of the source code, and encoded using whatever encoding, by the time the compiler recognizes it, it has already been assuming an encoding. And it if assumed wrong, it probably couldn't recognize it at all (consider source code in EBCDIC or even UCS-2/UTF-16). So at best, it could confirm what the compiler already knows. And since it has to be optional (obviously, no existing Ada source code has such a pragma), the absence of it doesn't tell the compiler anything, either. So, moral of the story: (A) Use only UTF-8 for portable Ada 2012 code; (B) complain to your vendor if the encoding does anything other than determine the source code encoding. Randy.