From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII X-Google-Thread: 103376,1086bab45b40d4b0 X-Google-Attributes: gid103376,public Path: controlnews3.google.com!news2.google.com!news.maxwell.syr.edu!central.cox.net!east.cox.net!filt01.cox.net!peer01.cox.net!cox.net!atl-c02.usenetserver.com!news.usenetserver.com!border1.nntp.ash.giganews.com!nntp.giganews.com!local1.nntp.ash.giganews.com!nntp.comcast.com!news.comcast.com.POSTED!not-for-mail NNTP-Posting-Date: Wed, 05 May 2004 18:31:15 -0500 Date: Wed, 05 May 2004 19:31:15 -0400 From: "Robert I. Eachus" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax) X-Accept-Language: en-us, en MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: UTF-8 in strings - a bug? References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Message-ID: NNTP-Posting-Host: 24.147.90.114 X-Trace: sv3-wAKTzm6d+Jk944zT9/XcvTDejO3UrUdMCh+VHEY6bYEAuQqBJwrL98YnWQln0kuBcV0jkEtms7c4cjc!D0r9ToxoB7abxWCRDaItYIJBlA3UNm71pFFS8wBXoypf1JlDH/LEm6rICvo3Zw== X-Complaints-To: abuse@comcast.net X-DMCA-Complaints-To: dmca@comcast.net X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.1 Xref: controlnews3.google.com comp.lang.ada:298 Date: 2004-05-05T19:31:15-04:00 List-Id: Bj�rn Persson wrote: > The reference manual says: > > 3.5.2(2): The predefined type Character is a character type whose values > correspond to the 256 code positions of Row 00 (also known as Latin-1) > of the ISO 10646 Basic Multilingual Plane (BMP). > > 3.6.3(4): type String is array(Positive range <>) of Character; > > It seems clear to me: Strings are Latin-1 (except for programs compiled > in nonstandard modes). But when I set my Fedora system to use UTF-8, the > strings I get from Ada.Command_Line.Argument contain UTF-8. This means > that some of the elements in the string aren't characters, only byte > values that are parts of multi-byte characters. And of course 'Length > returns the number of bytes, not the number of characters. This looks > like a violation of the standard. Should I consider this a bug in the > library? Or in the compiler (Gnat (GCC) 3.3.2 and 3.4.0)? Hmmmm... The technical answer is that GNAT is not validated on Fedora with UTF-8. The practical answer is that with GNAT, you should compile using the UTF-8 non-standard mode, if you are using UTF-8. But what if you want to validate on Fedora in UTF-8 mode? Then you will have to modify the libraries to get this "right." -- Robert I. Eachus "The terrorist enemy holds no territory, defends no population, is unconstrained by rules of warfare, and respects no law of morality. Such an enemy cannot be deterred, contained, appeased or negotiated with. It can only be destroyed--and that, ladies and gentlemen, is the business at hand." -- Dick Cheney