From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,ece5a18e6179c51a X-Google-Attributes: gid103376,public X-Google-ArrivalTime: 2003-10-23 08:49:35 PST Path: archiver1.google.com!news2.google.com!news.maxwell.syr.edu!wn14feed!worldnet.att.net!204.127.198.203!attbi_feed3!attbi_feed4!attbi.com!attbi_s51.POSTED!not-for-mail Message-ID: <3F97F83A.6060103@comcast.net> From: "Robert I. Eachus" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.0.2) Gecko/20021120 Netscape/7.01 X-Accept-Language: en-us, en MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: Ada, Gnat and Unicode References: <5d6fdb61.0310230648.62219442@posting.google.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit NNTP-Posting-Host: 24.34.139.183 X-Complaints-To: abuse@comcast.net X-Trace: attbi_s51 1066924174 24.34.139.183 (Thu, 23 Oct 2003 15:49:34 GMT) NNTP-Posting-Date: Thu, 23 Oct 2003 15:49:34 GMT Organization: Comcast Online Date: Thu, 23 Oct 2003 15:49:34 GMT Xref: archiver1.google.com comp.lang.ada:1523 Date: 2003-10-23T15:49:34+00:00 List-Id: Jano wrote: > I'm thinking about the best procedure to internationalize some Ada > program and I have some doubts. Please shed some light if you can. Okay. > AFAIK, the Ada Character type is the 256 first values from ISO 10646 > (Latin1). In the same fashion, Wide_Character are the 2**16 values of > that same ISO. The ARM furthermore says that an implementation can > provide alternate representations conforming to local conventions, but > later it states that said representation should be a proper subset of > these two. I'm not very sure about what that implies. First, that is correct. By default Standard.Character is Latin1. Some compilers, such as GNAT allow using other mappings. Second, what it means by the Implementation Advice is just that. It is a "nice to have" feature that if you choose say Latin2 there is a defined mapping from Character to Wide_Character. If you choose some other character set that is not in the BMP, it may not be possible. (For example Klingon, or Japanese Shift-JIS. ;-) All this says is vendors, please, if the mapping makes sense, provide it. And in fact the GNAT RM does document under Implementation Advice, that JIS and IEC Japanese encodings do not follow it, because for these two encodings, it doesn't make sense to do so. > Some old discussion suggest that 10646 and Unicode are equivalent, but > it seems that later they dissociated. In any case Unicode is more than > the 2**16 values that Wide_character can hold so I'm not sure that > Wide_character is useful at all (?) The best way to describe the relationship between ISO 10646-1 and Unicode is that the BMP (and some other planes of ISO 10646-1) are exactly mapped to Unicode and vice-versa. Unicode adds some things as part of the standard that are not part of ISO 10646-1 and vice-versa, but these areas where the standards differ can be for the most part ignored. For example, the ISO 10646 definition of UTF-8 allows for representing any (4 octet, 32-bit) character in UTF-8, while the Unicode standard only covers the encoding for Unicode. The practical effect of this is that characters outside the BMP but in Unicode have at least two potential representations. But if you get that far, you have already had to deal with the alternate representations of characters in the BMP through composition. (For example adding a cedilla to a "c".) Also, Unicode is stricter in determining which encodings should and should not be used. If you use UTF-8 for source input in GNAT, be aware that they only support UTF-8 for BMP characters, full UTF-8 including 6 octet encodings is not supported. (Note that all Unicode characters are effectively supported in GNAT, although you will have to use two 16-bit encodings as three octet sequences giving a six octet encoding...) > Anyhow, I was thinking of using UTF8 encoding. That's convenient as it > can hold anything in the Unicode world, is space efficient, provides > good interoperability with other languages/Packages (GtkAda, Java, > ...). > > My doubt principally comes from behavior when you're not using a > Latin1 OS, for example a Chinese Windows. When you do some I/O, for > example a read from console with Text_IO.Get (Wide_Text_IO?). Or when > using Gnat.Directory_Operations to enumerate files. > > I don't find information in the Gnat UG/RM about these things. Look again, in the GNAT Users Guide for "Foreign Language Representation." > What will these functions return? It's specified somewhere, or will they > pass the bytes from the underlying OS calls inside a String so I can't > know in advance what to expect? The real problems are in interpreting Strings and Wide_Strings and deciding when two Strings or Wide_Strings should compare true. As long as the canonicalization of the representations is outside your application, great. (For example, the OS probably provides a call for converting a Unicode string to a canonical representation.) Unless you really want to get deeply into writing Unicode (or ISO 10646-1) support, use whatever internationalization facilities the OS provides. Doing a better (or worse) job than the OS will get you no thanks, or even if you implement exactly the same rules and then the OS is updated. -- Robert I. Eachus "Quality is the Buddha. Quality is scientific reality. Quality is the goal of Art. It remains to work these concepts into a practical, down-to-earth context, and for this there is nothing more practical or down-to-earth than what I have been talking about all along...the repair of an old motorcycle." -- from Zen and the Art of Motorcycle Maintenance by Robert Pirsig