From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!weretis.net!feeder4.news.weretis.net!gandalf.srv.welterde.de!news.jacob-sparre.dk!loke.jacob-sparre.dk!pnx.dk!.POSTED!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: UTF-8 Output and "-gnatW8" Date: Mon, 4 Apr 2016 19:39:51 -0500 Organization: JSA Research & Innovation Message-ID: References: <35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com> <3a65e71c-41ee-49eb-916d-c0be8be9abc6@googlegroups.com> <6406289c-06a8-46d1-a633-8a1c8a22f79b@googlegroups.com> NNTP-Posting-Host: rrsoftware.com X-Trace: loke.gir.dk 1459816792 25461 24.196.82.226 (5 Apr 2016 00:39:52 GMT) X-Complaints-To: news@jacob-sparre.dk NNTP-Posting-Date: Tue, 5 Apr 2016 00:39:52 +0000 (UTC) X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-RFC2646: Format=Flowed; Response Xref: news.eternal-september.org comp.lang.ada:29973 Date: 2016-04-04T19:39:51-05:00 List-Id: "G.B." wrote in message news:ndtgqj$tah$1@dont-email.me... > On 30.03.16 00:35, Randy Brukardt wrote: ... > However -I'm guessing- there is embarrassment lurking behind > handling non-ASCII strings: > it mostly hinges on the pampered, old misunderstanding that char has > eight bits, 7 of which are to be used, and each is fixed to represent > one ASCII character. Not really. The problem stems from Ada 95 using Latin-1 as the primary character set; most Ada 95 compilers accept Latin-1 source code where all 8-bits are used. There is a lot of such source code in the wild. UTF-8 represents characters over position 127 as two bytes (as opposed to one). There is no possible automatic way to tell between these representations, as any legal UTF-8 representation of Latin-1 characters also has a (different) meaning if read as Latin-1. Thus, a compiler that needs to take both formats (like GNAT), needs to be told which format it is. Most compilers have a default format (probably Latin-1 from Ada 95), and changing that default would break a lot of customer's existing compilation scripts. So no vendor would do that, after all, it's easier to keep an existing customer than to get a new one. There's no benefit to pissing them off. A brand-new built-from-scratch compiler would almost certainly default to UTF-8 (that being the Ada 2012 default format). > Hence, trying to handle more than that in > any tool, including a compiler reading a source unit, is > deemed equivalent to tackling a hard problem of number theory. When you have two identical byte streams that the user intends to mean different things, it is clearly impossible for any tool to differentiate them. Something external has to describe the format. (It would be nice of commonly used OSes included this information, but they don't.) And what the heck does numeric literals have to do this this anyway? The OP was having problems with string literals. (Ada doesn't allow any non-ascii characters in numeric literals anyway, it's the identifiers, string literals, and comments that cause the issue.) Randy.