From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII
X-Google-Thread: 103376,1086bab45b40d4b0
X-Google-Attributes: gid103376,public
Path: 
 controlnews3.google.com!news1.google.com!news.glorb.com!cyclone1.gnilink.net!gnilink.net!wn13feed!worldnet.att.net!bgtnsc05-news.ops.worldnet.att.net.POSTED!not-for-mail
From: David Starner <dvdeug@email.ro>
Subject: Re: UTF-8 in strings - a bug?
User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table (Debian GNU/Linux))
Message-Id: <pan.2004.05.06.08.51.44.233412@email.ro>
Newsgroups: comp.lang.ada
References: <TEdmc.58085$mU6.237063@newsb.telia.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Date: Thu, 06 May 2004 09:06:58 GMT
NNTP-Posting-Host: 12.72.68.14
X-Complaints-To: abuse@worldnet.att.net
X-Trace: bgtnsc05-news.ops.worldnet.att.net 1083834418 12.72.68.14 (Thu,
 06 May 2004 09:06:58 GMT)
NNTP-Posting-Date: Thu, 06 May 2004 09:06:58 GMT
Organization: AT&T Worldnet
Xref: controlnews3.google.com comp.lang.ada:304
Date: 2004-05-06T09:06:58+00:00
List-Id: <comp.lang.ada>

On Wed, 05 May 2004 22:12:03 +0000, Bj�rn Persson wrote:

> It seems clear to me: Strings are Latin-1 (except for programs compiled
> in nonstandard modes). But when I set my Fedora system to use UTF-8, the
> strings I get from Ada.Command_Line.Argument contain UTF-8. 

Strings you get from anywhere, not just Ada.Command_Line.Argument contain
UTF-8. Try setting up an i/o loop with Ada.Text_IO and watch it pass
Cyrillic right through despite being stored as Strings. Ada.Text_IO
doesn't recode on output, so if you do include Latin-1 characters in a
string, you won't get acceptable output. I suspect that all current Ada
compilers on Unix systems work this way.

> This means
> that some of the elements in the string aren't characters, only byte
> values that are parts of multi-byte characters. And of course 'Length
> returns the number of bytes, not the number of characters. This looks
> like a violation of the standard. Should I consider this a bug in the
> library? Or in the compiler (Gnat (GCC) 3.3.2 and 3.4.0)?

I've always considered it a "feature". It may be suboptimal, but it's not
easy to fix and fixing it brings a large number of problems with it.

Consider this: what happens when the command line contains a filename?
That filename may not be in UTF-8; in fact, under Unix, filenames are
merely byte strings that don't contain 16#00# and 16#2F#. (On my
UTF-8 system, I've had several filenames copied from other systems in
Latin-1.) If you do recode it and ignore those files, you must turn it
back to its original encoding before passing the name to any system
function. Any string that may have to be exactly matched may have this
problem; far from all of my text files are UTF-8, but I may want to search
for a byte string in one of them.

Whether or not this is right depends on where Ada should be on the C-Java
spectrum. (Yes, I realize some will complain, but in geometry I may draw a
line through any two points and define that as an axis.) C doesn't worry
about it at all; it's harder to deal with, but it simplifies viewing the
world as a stream of bytes and interfacing with Unix filenames and files.
Java gets it "right" in converting everything to Unicode, but this is
often inefficient, and can make it harder to deal with the real world. I
doubt that Ada is going to get it "right" without too many incompatible
changes, so I'd rather it stayed "C-ish" here.