From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!weretis.net!feeder4.news.weretis.net!gandalf.srv.welterde.de!news.jacob-sparre.dk!loke.jacob-sparre.dk!pnx.dk!.POSTED!not-for-mail
From: "Randy Brukardt" <randy@rrsoftware.com>
Newsgroups: comp.lang.ada
Subject: Re: UTF-8 Output and "-gnatW8"
Date: Mon, 4 Apr 2016 19:39:51 -0500
Organization: JSA Research & Innovation
Message-ID: <ndv1go$orl$1@loke.gir.dk>
References: <35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com>
 <3a65e71c-41ee-49eb-916d-c0be8be9abc6@googlegroups.com>
 <nd42v3$fkq$1@loke.gir.dk>
 <6406289c-06a8-46d1-a633-8a1c8a22f79b@googlegroups.com>
 <ndevv9$smu$1@loke.gir.dk> <ndtgqj$tah$1@dont-email.me>
NNTP-Posting-Host: rrsoftware.com
X-Trace: loke.gir.dk 1459816792 25461 24.196.82.226 (5 Apr 2016 00:39:52 GMT)
X-Complaints-To: news@jacob-sparre.dk
NNTP-Posting-Date: Tue, 5 Apr 2016 00:39:52 +0000 (UTC)
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
X-RFC2646: Format=Flowed; Response
Xref: news.eternal-september.org comp.lang.ada:29973
Date: 2016-04-04T19:39:51-05:00
List-Id: <comp.lang.ada>

"G.B." <bauhaus@futureapps.invalid> wrote in message 
news:ndtgqj$tah$1@dont-email.me...
> On 30.03.16 00:35, Randy Brukardt wrote:
...
> However -I'm guessing- there is embarrassment lurking behind
> handling non-ASCII strings:
> it mostly hinges on the pampered, old misunderstanding that char has
> eight bits, 7 of which are to be used, and each is fixed to represent
> one ASCII character.

Not really. The problem stems from Ada 95 using Latin-1 as the primary 
character set; most Ada 95 compilers accept Latin-1 source code where all 
8-bits are used. There is a lot of such source code in the wild.

UTF-8 represents characters over position 127 as two bytes (as opposed to 
one). There is no possible automatic way to tell between these 
representations, as any legal UTF-8 representation of Latin-1 characters 
also has a (different) meaning if read as Latin-1.

Thus, a compiler that needs to take both formats (like GNAT), needs to be 
told which format it is. Most compilers have a default format (probably 
Latin-1 from Ada 95), and changing that default would break a lot of 
customer's existing compilation scripts. So no vendor would do that, after 
all, it's easier to keep an existing customer than to get a new one. There's 
no benefit to pissing them off.

A brand-new built-from-scratch compiler would almost certainly default to 
UTF-8 (that being the Ada 2012 default format).

> Hence, trying to handle more than that in
> any tool, including a compiler reading a source unit, is
> deemed equivalent to tackling a hard problem of number theory.

When you have two identical byte streams that the user intends to mean 
different things, it is clearly impossible for any tool to differentiate 
them. Something external has to describe the format. (It would be nice of 
commonly used OSes included this information, but they don't.)

And what the heck does numeric literals have to do this this anyway? The OP 
was having problems with string literals. (Ada doesn't allow any non-ascii 
characters in numeric literals anyway, it's the identifiers, string 
literals, and comments that cause the issue.)

                               Randy.