Re: UTF-8 Output and "-gnatW8"

comp.lang.ada
 help / color / mirror / Atom feed

From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: UTF-8 Output and "-gnatW8"
Date: Thu, 24 Mar 2016 17:09:31 -0500
Date: 2016-03-24T17:09:31-05:00	[thread overview]
Message-ID: <nd1oir$c8r$1@loke.gir.dk> (raw)
In-Reply-To: 35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com

"Michael Rohan" <michael@zanyblue.com> wrote in message 
news:35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com...
...
>I've been using this option is state that my source files are UTF-8 encoded 
>but
>I don't particular want to change the behaviour of the Ada.Text_IO 
>routines.

I don't see any reason that the character encoding option ought to change 
the runtime behavior of anything - it ought to just tell the compiler about 
the form of the source code. But I'm definitely not an expert in GNAT.

>  I don't see an option that covers just the source file encoding without 
> impacting the Text_IO (narrow) functionality.

I don't see anything in the documentation you posted that it has any effect 
on Text_IO, nor would I expect it to, since it says it controls the 
representation of Wide_Characters, and there are no wide characters 
associated with Text_IO.

>It's pretty easy to see this.  Here's an already UTF-8 encoded string 
>example:
>
>with Ada.Text_IO;
>procedure PiDay is
>begin
>   Ada.Text_IO.Put_Line (
>      "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day.");
>end PiDay;

Since this program text doesn't include any wide characters, there should be 
no effect on the behavior of Text_IO.

I think what you are seeing is just a bug; I'd suggest report it as a bug to 
AdaCore and see what they say. (Even if they intended something to happen 
here, it seems to be a horribly bad idea.) My guess is that they are folding 
the string literal and then encoding that into UTF-8, even though such 
encoding is too late.

>The RM includes an "Implementation Requirement":
>
>16/3
> An Ada implementation shall accept Ada source code in UTF-8 encoding, with 
> or
> without a BOM (see A.4.11), where every character is represented by its 
> code
> point. The character pair CARRIAGE RETURN/LINE FEED (code points
>16#0D# 16#0A#) signifies a single end of line (see 2.2); every other 
>occurrence
> of a format_effector other than the character whose code point position is 
> 16#09#
> (CHARACTER TABULATION) also signifies a single end of line.

Two points here:

(1) The Ada Standard requires no other encoding. The expectation is that in 
the long term, all Ada (portable) source code will be encoded in UTF-8. 
There's no requirement for a compiler to support anything else, and the only 
need beyond that is to process legacy code -- a tool similar to GNATChop 
could handle that without messing up the compiler. (Note that the ACATS is 
provided only in 7-bit ASCII and UTF-8 encoded files, and the former is a 
subset of the latter.)

(2) This is *only* about the source encoding. It has no effect on anything 
beyond the lexical level of an Ada program. In particular, it has no effect 
on any runtime behavior. Indeed, source encoding is so different than 
anything specified in the Ada Standard that in previous versions of Ada, it 
wasn't specified at all. Source encoding, other than the UTF-8 encoding 
defined in the Standard, is inherently implementation-defined, because the 
intepretation of the encoding has to happen before any Ada rules can be 
applied (from lexical and syntax rules on down).

>It feels like we should be able to explicitly define the encoding for a 
>source via pragma:
>
>    pragma Character_Set ("UTF-8");

This is clearly pointless:
(1) As noted above, the only required source encoding is UTF-8. If you need 
portable code, there is no other choice, and if you don't, you don't need a 
portable way to specify it.
(2) It should be obvious that a pragma is too late. Since such a pragma is 
inside of the source code, and encoded using whatever encoding, by the time 
the compiler recognizes it, it has already been assuming an encoding. And it 
if assumed wrong, it probably couldn't recognize it at all (consider source 
code in EBCDIC or even UCS-2/UTF-16). So at best, it could confirm what the 
compiler already knows. And since it has to be optional (obviously, no 
existing Ada source code has such a pragma), the absence of it doesn't tell 
the compiler anything, either.

So, moral of the story: (A) Use only UTF-8 for portable Ada 2012 code; (B) 
complain to your vendor if the encoding does anything other than determine 
the source code encoding.

                                    Randy.

next prev parent reply	other threads:[~2016-03-24 22:09 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-24 17:23 UTF-8 Output and "-gnatW8" Michael Rohan
2016-03-24 22:09 ` Randy Brukardt [this message]
2016-03-24 22:34   ` Michael Rohan
2016-03-25 19:15     ` Randy Brukardt
2016-03-25  5:54 ` rieachus
2016-03-25 19:18   ` Randy Brukardt
2016-03-28 22:48     ` Michael Rohan
2016-03-29  7:44       ` Dmitry A. Kazakov
2016-03-29  8:39       ` G.B.
2016-03-29 22:35       ` Randy Brukardt
2016-04-04 10:52         ` G.B.
2016-04-05  0:39           ` Randy Brukardt

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox