From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,ece5a18e6179c51a
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2003-10-23 08:49:35 PST
Path: 
 archiver1.google.com!news2.google.com!news.maxwell.syr.edu!wn14feed!worldnet.att.net!204.127.198.203!attbi_feed3!attbi_feed4!attbi.com!attbi_s51.POSTED!not-for-mail
Message-ID: <3F97F83A.6060103@comcast.net>
From: "Robert I. Eachus" <rieachus@comcast.net>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
 rv:1.0.2) Gecko/20021120 Netscape/7.01
X-Accept-Language: en-us, en
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: Ada, Gnat and Unicode
References: <5d6fdb61.0310230648.62219442@posting.google.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
NNTP-Posting-Host: 24.34.139.183
X-Complaints-To: abuse@comcast.net
X-Trace: attbi_s51 1066924174 24.34.139.183 (Thu, 23 Oct 2003 15:49:34 GMT)
NNTP-Posting-Date: Thu, 23 Oct 2003 15:49:34 GMT
Organization: Comcast Online
Date: Thu, 23 Oct 2003 15:49:34 GMT
Xref: archiver1.google.com comp.lang.ada:1523
Date: 2003-10-23T15:49:34+00:00
List-Id: <comp.lang.ada>

Jano wrote:

> I'm thinking about the best procedure to internationalize some Ada
> program and I have some doubts. Please shed some light if you can.

Okay.

> AFAIK, the Ada Character type is the 256 first values from ISO 10646
> (Latin1). In the same fashion, Wide_Character are the 2**16 values of
> that same ISO. The ARM furthermore says that an implementation can
> provide alternate representations conforming to local conventions, but
> later it states that said representation should be a proper subset of
> these two. I'm not very sure about what that implies.

First, that is correct.  By default Standard.Character is Latin1.  Some 
compilers, such as GNAT allow using other mappings.

Second, what it means by the Implementation Advice is just that. It is a 
"nice to have" feature that if you choose say Latin2 there is a defined 
mapping from Character to Wide_Character.  If you choose some other 
character set that is not in the BMP, it may not be possible. (For 
example Klingon, or Japanese Shift-JIS. ;-) All this says is vendors, 
please, if the mapping makes sense, provide it.  And in fact the GNAT RM 
does document under Implementation Advice, that JIS and IEC Japanese 
encodings do not follow it, because for these two encodings, it doesn't 
make sense to do so.

> Some old discussion suggest that 10646 and Unicode are equivalent, but
> it seems that later they dissociated. In any case Unicode is more than
> the 2**16 values that Wide_character can hold so I'm not sure that
> Wide_character is useful at all (?)

The best way to describe the relationship between ISO 10646-1 and 
Unicode is that the BMP (and some other planes of ISO 10646-1) are 
exactly mapped to Unicode and vice-versa.  Unicode adds some things as 
part of the standard that are not part of ISO 10646-1 and vice-versa, 
but these areas where the standards differ can be for the most part 
ignored.  For example, the ISO 10646 definition of UTF-8 allows for 
representing any (4 octet, 32-bit) character in UTF-8, while the Unicode 
standard only covers the encoding for Unicode.

The practical effect of this is that characters outside the BMP but in 
Unicode have at least two potential representations.  But if you get 
that far, you have already had to deal with the alternate 
representations of characters in the BMP through composition.  (For 
example adding a cedilla to a "c".)  Also, Unicode is stricter in 
determining which encodings should and should not be used.

If you use UTF-8 for source input in GNAT, be aware that they only 
support UTF-8 for BMP characters, full UTF-8 including 6 octet encodings 
is not supported.  (Note that all Unicode characters are effectively 
supported in GNAT, although you will have to use two 16-bit encodings as 
three octet sequences giving a six octet encoding...)

> Anyhow, I was thinking of using UTF8 encoding. That's convenient as it
> can hold anything in the Unicode world, is space efficient, provides
> good interoperability with other languages/Packages (GtkAda, Java,
> ...).
> 
> My doubt principally comes from behavior when you're not using a
> Latin1 OS, for example a Chinese Windows. When you do some I/O, for
> example a read from console with Text_IO.Get (Wide_Text_IO?). Or when
> using Gnat.Directory_Operations to enumerate files.
> 
> I don't find information in the Gnat UG/RM about these things.

Look again, in the GNAT Users Guide for "Foreign Language Representation."

> What will these functions return? It's specified somewhere, or will they
> pass the bytes from the underlying OS calls inside a String so I can't
> know in advance what to expect?

The real problems are in interpreting Strings and Wide_Strings and 
deciding when two Strings or Wide_Strings should compare true.  As long 
as the canonicalization of the representations is outside your 
application, great.  (For example, the OS probably provides a call for 
converting a Unicode string to a canonical representation.)  Unless you 
really want to get deeply into writing Unicode (or ISO 10646-1) support, 
use whatever internationalization facilities the OS provides.  Doing a 
better (or worse) job than the OS will get you no thanks, or even if you 
implement exactly the same rules and then the OS is updated.

-- 
                                                     Robert I. Eachus

"Quality is the Buddha. Quality is scientific reality. Quality is the 
goal of Art. It remains to work these concepts into a practical, 
down-to-earth context, and for this there is nothing more practical or 
down-to-earth than what I have been talking about all along...the repair 
of an old motorcycle."  -- from Zen and the Art of Motorcycle 
Maintenance by Robert Pirsig