From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII X-Google-Thread: 103376,1086bab45b40d4b0 X-Google-Attributes: gid103376,public Path: controlnews3.google.com!news1.google.com!news.glorb.com!cyclone1.gnilink.net!gnilink.net!wn13feed!worldnet.att.net!bgtnsc05-news.ops.worldnet.att.net.POSTED!not-for-mail From: David Starner Subject: Re: UTF-8 in strings - a bug? User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table (Debian GNU/Linux)) Message-Id: Newsgroups: comp.lang.ada References: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Date: Thu, 06 May 2004 09:06:58 GMT NNTP-Posting-Host: 12.72.68.14 X-Complaints-To: abuse@worldnet.att.net X-Trace: bgtnsc05-news.ops.worldnet.att.net 1083834418 12.72.68.14 (Thu, 06 May 2004 09:06:58 GMT) NNTP-Posting-Date: Thu, 06 May 2004 09:06:58 GMT Organization: AT&T Worldnet Xref: controlnews3.google.com comp.lang.ada:304 Date: 2004-05-06T09:06:58+00:00 List-Id: On Wed, 05 May 2004 22:12:03 +0000, Bj�rn Persson wrote: > It seems clear to me: Strings are Latin-1 (except for programs compiled > in nonstandard modes). But when I set my Fedora system to use UTF-8, the > strings I get from Ada.Command_Line.Argument contain UTF-8. Strings you get from anywhere, not just Ada.Command_Line.Argument contain UTF-8. Try setting up an i/o loop with Ada.Text_IO and watch it pass Cyrillic right through despite being stored as Strings. Ada.Text_IO doesn't recode on output, so if you do include Latin-1 characters in a string, you won't get acceptable output. I suspect that all current Ada compilers on Unix systems work this way. > This means > that some of the elements in the string aren't characters, only byte > values that are parts of multi-byte characters. And of course 'Length > returns the number of bytes, not the number of characters. This looks > like a violation of the standard. Should I consider this a bug in the > library? Or in the compiler (Gnat (GCC) 3.3.2 and 3.4.0)? I've always considered it a "feature". It may be suboptimal, but it's not easy to fix and fixing it brings a large number of problems with it. Consider this: what happens when the command line contains a filename? That filename may not be in UTF-8; in fact, under Unix, filenames are merely byte strings that don't contain 16#00# and 16#2F#. (On my UTF-8 system, I've had several filenames copied from other systems in Latin-1.) If you do recode it and ignore those files, you must turn it back to its original encoding before passing the name to any system function. Any string that may have to be exactly matched may have this problem; far from all of my text files are UTF-8, but I may want to search for a byte string in one of them. Whether or not this is right depends on where Ada should be on the C-Java spectrum. (Yes, I realize some will complain, but in geometry I may draw a line through any two points and define that as an axis.) C doesn't worry about it at all; it's harder to deal with, but it simplifies viewing the world as a stream of bytes and interfacing with Unix filenames and files. Java gets it "right" in converting everything to Unicode, but this is often inefficient, and can make it harder to deal with the real world. I doubt that Ada is going to get it "right" without too many incompatible changes, so I'd rather it stayed "C-ish" here.