Re: UTF-8 in strings - a bug?

comp.lang.ada
 help / color / mirror / Atom feed

From: David Starner <dvdeug@email.ro>
Subject: Re: UTF-8 in strings - a bug?
Date: Thu, 06 May 2004 09:06:58 GMT
Date: 2004-05-06T09:06:58+00:00	[thread overview]
Message-ID: <pan.2004.05.06.08.51.44.233412@email.ro> (raw)
In-Reply-To: TEdmc.58085$mU6.237063@newsb.telia.net

On Wed, 05 May 2004 22:12:03 +0000, Bjï¿½rn Persson wrote:

> It seems clear to me: Strings are Latin-1 (except for programs compiled
> in nonstandard modes). But when I set my Fedora system to use UTF-8, the
> strings I get from Ada.Command_Line.Argument contain UTF-8. 

Strings you get from anywhere, not just Ada.Command_Line.Argument contain
UTF-8. Try setting up an i/o loop with Ada.Text_IO and watch it pass
Cyrillic right through despite being stored as Strings. Ada.Text_IO
doesn't recode on output, so if you do include Latin-1 characters in a
string, you won't get acceptable output. I suspect that all current Ada
compilers on Unix systems work this way.

> This means
> that some of the elements in the string aren't characters, only byte
> values that are parts of multi-byte characters. And of course 'Length
> returns the number of bytes, not the number of characters. This looks
> like a violation of the standard. Should I consider this a bug in the
> library? Or in the compiler (Gnat (GCC) 3.3.2 and 3.4.0)?

I've always considered it a "feature". It may be suboptimal, but it's not
easy to fix and fixing it brings a large number of problems with it.

Consider this: what happens when the command line contains a filename?
That filename may not be in UTF-8; in fact, under Unix, filenames are
merely byte strings that don't contain 16#00# and 16#2F#. (On my
UTF-8 system, I've had several filenames copied from other systems in
Latin-1.) If you do recode it and ignore those files, you must turn it
back to its original encoding before passing the name to any system
function. Any string that may have to be exactly matched may have this
problem; far from all of my text files are UTF-8, but I may want to search
for a byte string in one of them.

Whether or not this is right depends on where Ada should be on the C-Java
spectrum. (Yes, I realize some will complain, but in geometry I may draw a
line through any two points and define that as an axis.) C doesn't worry
about it at all; it's harder to deal with, but it simplifies viewing the
world as a stream of bytes and interfacing with Unix filenames and files.
Java gets it "right" in converting everything to Unicode, but this is
often inefficient, and can make it harder to deal with the real world. I
doubt that Ada is going to get it "right" without too many incompatible
changes, so I'd rather it stayed "C-ish" here.

next prev parent reply	other threads:[~2004-05-06  9:06 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-05-05 22:12 UTF-8 in strings - a bug? Björn Persson
2004-05-05 23:31 ` Robert I. Eachus
2004-05-06  8:34   ` Björn Persson
2004-05-06  9:25     ` Ludovic Brenta
2004-05-06 17:13       ` Björn Persson
2004-05-06 18:24       ` Martin Krischik
2004-05-07 23:32         ` Björn Persson
2004-05-08  6:38           ` Martin Krischik
2004-05-08  7:44           ` Jacob Sparre Andersen
2004-05-08 11:06             ` Björn Persson
2004-05-08 16:25               ` Martin Krischik
2004-05-09 12:16                 ` Georg Bauhaus
2004-05-10  6:29                   ` Martin Krischik
2004-05-08 12:10           ` Georg Bauhaus
2004-05-06  9:06 ` David Starner [this message]
2004-05-06 17:36   ` Björn Persson

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox