From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.albasani.net!reality.xs3.de!news.jacob-sparre.dk!loke.jacob-sparre.dk!pnx.dk!.POSTED!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: UTF-8 Output and "-gnatW8" Date: Fri, 25 Mar 2016 14:15:00 -0500 Organization: JSA Research & Innovation Message-ID: References: <35689862-61dc-4186-87d3-37b17abed5a2@googlegroups.com> <4f157cc0-1d3c-46d7-ab19-88a13ae0afd0@googlegroups.com> NNTP-Posting-Host: rrsoftware.com X-Trace: loke.gir.dk 1458933301 15973 24.196.82.226 (25 Mar 2016 19:15:01 GMT) X-Complaints-To: news@jacob-sparre.dk NNTP-Posting-Date: Fri, 25 Mar 2016 19:15:01 +0000 (UTC) X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Original X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Xref: news.eternal-september.org comp.lang.ada:29885 Date: 2016-03-25T14:15:00-05:00 List-Id: "Michael Rohan" wrote in message news:4f157cc0-1d3c-46d7-ab19-88a13ae0afd0@googlegroups.com... >Hi, > >OK, so this might be a compiler bug. The RM states the character set >should >be ISO 10646 so EBCDIC would seem to be something that is not allowed. Ah, that's a common mistake. The RM specifies what the *runtime* character set it. Prior to Ada 2012, it said *nothing* about the encoding of Ada source code, and even now, it only talks about UTF-8 as one possibility for that encoding. Anything else is allowed, including EBCDIC, Shift-JIS, or even some sort of tree (the latter is explicitly mentioned as a possibility in the AARM). Someone even suggested a source representation where '{' = "begin", "} = "end", etc. (It was that suggestion that finally got the UTF-8 "standard" encoding into the Standard, to provide real interoperability for Ada source code.) >The implementation for GNAT impacts the handling of strings, e.g., > >S : constant Wide_String := "?"; > >With "-gnatW8" this is correctly interpreted as a string of length 1 >containing the character U+03C0. Without the "-gnatW8" option, GNAT >interprets it as a string of Characters to convert to a Wide_String, >i.e., the two character U+00CF and U+0080 That seems right to me. (In a new compiler, I'd make UTF-8 the default, but any existing compiler probably would have to make it a switch of some sort.) But that's because you have a UTF-8 character in the source code. The bug is that you said that some source code with no explicit UTF-8 characters (rather representing them as Character'Val(16#C0#) and the like) was changing behavior in response to such a switch. That's a bug in my view(Character'Val(16#C0#) isn't a character literal at compile-time, it's a function call, and it's representation is the same regardless of whether the source is read as 7-bit ASCII or UTF-8). >Is the constant string value ambiguous here? It means something different depending upon the source representation. I belive GNAT is getting that correct. "Character'Val(16#C0#)" means the same thing in either source representation, so you should get the same results for the program containing that. If you don't, that's a bug. Hope this clears it up. Randy. .