Re: UTF-8 Output and "-gnatW8"

comp.lang.ada
 help / color / mirror / Atom feed

From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: UTF-8 Output and "-gnatW8"
Date: Mon, 4 Apr 2016 19:39:51 -0500
Date: 2016-04-04T19:39:51-05:00	[thread overview]
Message-ID: <ndv1go$orl$1@loke.gir.dk> (raw)
In-Reply-To: ndtgqj$tah$1@dont-email.me

"G.B." <bauhaus@futureapps.invalid> wrote in message 
news:ndtgqj$tah$1@dont-email.me...
> On 30.03.16 00:35, Randy Brukardt wrote:
...
> However -I'm guessing- there is embarrassment lurking behind
> handling non-ASCII strings:
> it mostly hinges on the pampered, old misunderstanding that char has
> eight bits, 7 of which are to be used, and each is fixed to represent
> one ASCII character.

Not really. The problem stems from Ada 95 using Latin-1 as the primary 
character set; most Ada 95 compilers accept Latin-1 source code where all 
8-bits are used. There is a lot of such source code in the wild.

UTF-8 represents characters over position 127 as two bytes (as opposed to 
one). There is no possible automatic way to tell between these 
representations, as any legal UTF-8 representation of Latin-1 characters 
also has a (different) meaning if read as Latin-1.

Thus, a compiler that needs to take both formats (like GNAT), needs to be 
told which format it is. Most compilers have a default format (probably 
Latin-1 from Ada 95), and changing that default would break a lot of 
customer's existing compilation scripts. So no vendor would do that, after 
all, it's easier to keep an existing customer than to get a new one. There's 
no benefit to pissing them off.

A brand-new built-from-scratch compiler would almost certainly default to 
UTF-8 (that being the Ada 2012 default format).

> Hence, trying to handle more than that in
> any tool, including a compiler reading a source unit, is
> deemed equivalent to tackling a hard problem of number theory.

When you have two identical byte streams that the user intends to mean 
different things, it is clearly impossible for any tool to differentiate 
them. Something external has to describe the format. (It would be nice of 
commonly used OSes included this information, but they don't.)

And what the heck does numeric literals have to do this this anyway? The OP 
was having problems with string literals. (Ada doesn't allow any non-ascii 
characters in numeric literals anyway, it's the identifiers, string 
literals, and comments that cause the issue.)

                               Randy.

     prev parent reply	other threads:[~2016-04-05  0:39 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-24 17:23 UTF-8 Output and "-gnatW8" Michael Rohan
2016-03-24 22:09 ` Randy Brukardt
2016-03-24 22:34   ` Michael Rohan
2016-03-25 19:15     ` Randy Brukardt
2016-03-25  5:54 ` rieachus
2016-03-25 19:18   ` Randy Brukardt
2016-03-28 22:48     ` Michael Rohan
2016-03-29  7:44       ` Dmitry A. Kazakov
2016-03-29  8:39       ` G.B.
2016-03-29 22:35       ` Randy Brukardt
2016-04-04 10:52         ` G.B.
2016-04-05  0:39           ` Randy Brukardt [this message]

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox