From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: UTF-8 Output and "-gnatW8"
Date: Thu, 24 Mar 2016 17:09:31 -0500
Date: 2016-03-24T17:09:31-05:00 [thread overview]
Message-ID: <nd1oir$c8r$1@loke.gir.dk> (raw)
In-Reply-To: 35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com
"Michael Rohan" <michael@zanyblue.com> wrote in message
news:35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com...
...
>I've been using this option is state that my source files are UTF-8 encoded
>but
>I don't particular want to change the behaviour of the Ada.Text_IO
>routines.
I don't see any reason that the character encoding option ought to change
the runtime behavior of anything - it ought to just tell the compiler about
the form of the source code. But I'm definitely not an expert in GNAT.
> I don't see an option that covers just the source file encoding without
> impacting the Text_IO (narrow) functionality.
I don't see anything in the documentation you posted that it has any effect
on Text_IO, nor would I expect it to, since it says it controls the
representation of Wide_Characters, and there are no wide characters
associated with Text_IO.
>It's pretty easy to see this. Here's an already UTF-8 encoded string
>example:
>
>with Ada.Text_IO;
>procedure PiDay is
>begin
> Ada.Text_IO.Put_Line (
> "It's " & Character'Val (16#CF#) & Character'Val (16#80#) & " day.");
>end PiDay;
Since this program text doesn't include any wide characters, there should be
no effect on the behavior of Text_IO.
I think what you are seeing is just a bug; I'd suggest report it as a bug to
AdaCore and see what they say. (Even if they intended something to happen
here, it seems to be a horribly bad idea.) My guess is that they are folding
the string literal and then encoding that into UTF-8, even though such
encoding is too late.
>The RM includes an "Implementation Requirement":
>
>16/3
> An Ada implementation shall accept Ada source code in UTF-8 encoding, with
> or
> without a BOM (see A.4.11), where every character is represented by its
> code
> point. The character pair CARRIAGE RETURN/LINE FEED (code points
>16#0D# 16#0A#) signifies a single end of line (see 2.2); every other
>occurrence
> of a format_effector other than the character whose code point position is
> 16#09#
> (CHARACTER TABULATION) also signifies a single end of line.
Two points here:
(1) The Ada Standard requires no other encoding. The expectation is that in
the long term, all Ada (portable) source code will be encoded in UTF-8.
There's no requirement for a compiler to support anything else, and the only
need beyond that is to process legacy code -- a tool similar to GNATChop
could handle that without messing up the compiler. (Note that the ACATS is
provided only in 7-bit ASCII and UTF-8 encoded files, and the former is a
subset of the latter.)
(2) This is *only* about the source encoding. It has no effect on anything
beyond the lexical level of an Ada program. In particular, it has no effect
on any runtime behavior. Indeed, source encoding is so different than
anything specified in the Ada Standard that in previous versions of Ada, it
wasn't specified at all. Source encoding, other than the UTF-8 encoding
defined in the Standard, is inherently implementation-defined, because the
intepretation of the encoding has to happen before any Ada rules can be
applied (from lexical and syntax rules on down).
>It feels like we should be able to explicitly define the encoding for a
>source via pragma:
>
> pragma Character_Set ("UTF-8");
This is clearly pointless:
(1) As noted above, the only required source encoding is UTF-8. If you need
portable code, there is no other choice, and if you don't, you don't need a
portable way to specify it.
(2) It should be obvious that a pragma is too late. Since such a pragma is
inside of the source code, and encoded using whatever encoding, by the time
the compiler recognizes it, it has already been assuming an encoding. And it
if assumed wrong, it probably couldn't recognize it at all (consider source
code in EBCDIC or even UCS-2/UTF-16). So at best, it could confirm what the
compiler already knows. And since it has to be optional (obviously, no
existing Ada source code has such a pragma), the absence of it doesn't tell
the compiler anything, either.
So, moral of the story: (A) Use only UTF-8 for portable Ada 2012 code; (B)
complain to your vendor if the encoding does anything other than determine
the source code encoding.
Randy.
next prev parent reply other threads:[~2016-03-24 22:09 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-24 17:23 UTF-8 Output and "-gnatW8" Michael Rohan
2016-03-24 22:09 ` Randy Brukardt [this message]
2016-03-24 22:34 ` Michael Rohan
2016-03-25 19:15 ` Randy Brukardt
2016-03-25 5:54 ` rieachus
2016-03-25 19:18 ` Randy Brukardt
2016-03-28 22:48 ` Michael Rohan
2016-03-29 7:44 ` Dmitry A. Kazakov
2016-03-29 8:39 ` G.B.
2016-03-29 22:35 ` Randy Brukardt
2016-04-04 10:52 ` G.B.
2016-04-05 0:39 ` Randy Brukardt
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox