From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: UTF-8 Output and "-gnatW8"
Date: Mon, 4 Apr 2016 19:39:51 -0500
Date: 2016-04-04T19:39:51-05:00 [thread overview]
Message-ID: <ndv1go$orl$1@loke.gir.dk> (raw)
In-Reply-To: ndtgqj$tah$1@dont-email.me
"G.B." <bauhaus@futureapps.invalid> wrote in message
news:ndtgqj$tah$1@dont-email.me...
> On 30.03.16 00:35, Randy Brukardt wrote:
...
> However -I'm guessing- there is embarrassment lurking behind
> handling non-ASCII strings:
> it mostly hinges on the pampered, old misunderstanding that char has
> eight bits, 7 of which are to be used, and each is fixed to represent
> one ASCII character.
Not really. The problem stems from Ada 95 using Latin-1 as the primary
character set; most Ada 95 compilers accept Latin-1 source code where all
8-bits are used. There is a lot of such source code in the wild.
UTF-8 represents characters over position 127 as two bytes (as opposed to
one). There is no possible automatic way to tell between these
representations, as any legal UTF-8 representation of Latin-1 characters
also has a (different) meaning if read as Latin-1.
Thus, a compiler that needs to take both formats (like GNAT), needs to be
told which format it is. Most compilers have a default format (probably
Latin-1 from Ada 95), and changing that default would break a lot of
customer's existing compilation scripts. So no vendor would do that, after
all, it's easier to keep an existing customer than to get a new one. There's
no benefit to pissing them off.
A brand-new built-from-scratch compiler would almost certainly default to
UTF-8 (that being the Ada 2012 default format).
> Hence, trying to handle more than that in
> any tool, including a compiler reading a source unit, is
> deemed equivalent to tackling a hard problem of number theory.
When you have two identical byte streams that the user intends to mean
different things, it is clearly impossible for any tool to differentiate
them. Something external has to describe the format. (It would be nice of
commonly used OSes included this information, but they don't.)
And what the heck does numeric literals have to do this this anyway? The OP
was having problems with string literals. (Ada doesn't allow any non-ascii
characters in numeric literals anyway, it's the identifiers, string
literals, and comments that cause the issue.)
Randy.
prev parent reply other threads:[~2016-04-05 0:39 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-24 17:23 UTF-8 Output and "-gnatW8" Michael Rohan
2016-03-24 22:09 ` Randy Brukardt
2016-03-24 22:34 ` Michael Rohan
2016-03-25 19:15 ` Randy Brukardt
2016-03-25 5:54 ` rieachus
2016-03-25 19:18 ` Randy Brukardt
2016-03-28 22:48 ` Michael Rohan
2016-03-29 7:44 ` Dmitry A. Kazakov
2016-03-29 8:39 ` G.B.
2016-03-29 22:35 ` Randy Brukardt
2016-04-04 10:52 ` G.B.
2016-04-05 0:39 ` Randy Brukardt [this message]
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox