* Ada, Gnat and Unicode @ 2003-10-23 14:48 Jano 2003-10-23 15:49 ` Robert I. Eachus 2003-10-24 4:01 ` Steve 0 siblings, 2 replies; 7+ messages in thread From: Jano @ 2003-10-23 14:48 UTC (permalink / raw) Hello sirs, I'm thinking about the best procedure to internationalize some Ada program and I have some doubts. Please shed some light if you can. AFAIK, the Ada Character type is the 256 first values from ISO 10646 (Latin1). In the same fashion, Wide_Character are the 2**16 values of that same ISO. The ARM furthermore says that an implementation can provide alternate representations conforming to local conventions, but later it states that said representation should be a proper subset of these two. I'm not very sure about what that implies. Some old discussion suggest that 10646 and Unicode are equivalent, but it seems that later they dissociated. In any case Unicode is more than the 2**16 values that Wide_character can hold so I'm not sure that Wide_character is useful at all (?) Anyhow, I was thinking of using UTF8 encoding. That's convenient as it can hold anything in the Unicode world, is space efficient, provides good interoperability with other languages/Packages (GtkAda, Java, ...). My doubt principally comes from behavior when you're not using a Latin1 OS, for example a Chinese Windows. When you do some I/O, for example a read from console with Text_IO.Get (Wide_Text_IO?). Or when using Gnat.Directory_Operations to enumerate files. I don't find information in the Gnat UG/RM about these things. What will these functions return? It's specified somewhere, or will they pass the bytes from the underlying OS calls inside a String so I can't know in advance what to expect? Thanks for any clarifications, Alex. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Ada, Gnat and Unicode 2003-10-23 14:48 Ada, Gnat and Unicode Jano @ 2003-10-23 15:49 ` Robert I. Eachus 2003-10-23 17:38 ` Jano 2003-10-24 4:01 ` Steve 1 sibling, 1 reply; 7+ messages in thread From: Robert I. Eachus @ 2003-10-23 15:49 UTC (permalink / raw) Jano wrote: > I'm thinking about the best procedure to internationalize some Ada > program and I have some doubts. Please shed some light if you can. Okay. > AFAIK, the Ada Character type is the 256 first values from ISO 10646 > (Latin1). In the same fashion, Wide_Character are the 2**16 values of > that same ISO. The ARM furthermore says that an implementation can > provide alternate representations conforming to local conventions, but > later it states that said representation should be a proper subset of > these two. I'm not very sure about what that implies. First, that is correct. By default Standard.Character is Latin1. Some compilers, such as GNAT allow using other mappings. Second, what it means by the Implementation Advice is just that. It is a "nice to have" feature that if you choose say Latin2 there is a defined mapping from Character to Wide_Character. If you choose some other character set that is not in the BMP, it may not be possible. (For example Klingon, or Japanese Shift-JIS. ;-) All this says is vendors, please, if the mapping makes sense, provide it. And in fact the GNAT RM does document under Implementation Advice, that JIS and IEC Japanese encodings do not follow it, because for these two encodings, it doesn't make sense to do so. > Some old discussion suggest that 10646 and Unicode are equivalent, but > it seems that later they dissociated. In any case Unicode is more than > the 2**16 values that Wide_character can hold so I'm not sure that > Wide_character is useful at all (?) The best way to describe the relationship between ISO 10646-1 and Unicode is that the BMP (and some other planes of ISO 10646-1) are exactly mapped to Unicode and vice-versa. Unicode adds some things as part of the standard that are not part of ISO 10646-1 and vice-versa, but these areas where the standards differ can be for the most part ignored. For example, the ISO 10646 definition of UTF-8 allows for representing any (4 octet, 32-bit) character in UTF-8, while the Unicode standard only covers the encoding for Unicode. The practical effect of this is that characters outside the BMP but in Unicode have at least two potential representations. But if you get that far, you have already had to deal with the alternate representations of characters in the BMP through composition. (For example adding a cedilla to a "c".) Also, Unicode is stricter in determining which encodings should and should not be used. If you use UTF-8 for source input in GNAT, be aware that they only support UTF-8 for BMP characters, full UTF-8 including 6 octet encodings is not supported. (Note that all Unicode characters are effectively supported in GNAT, although you will have to use two 16-bit encodings as three octet sequences giving a six octet encoding...) > Anyhow, I was thinking of using UTF8 encoding. That's convenient as it > can hold anything in the Unicode world, is space efficient, provides > good interoperability with other languages/Packages (GtkAda, Java, > ...). > > My doubt principally comes from behavior when you're not using a > Latin1 OS, for example a Chinese Windows. When you do some I/O, for > example a read from console with Text_IO.Get (Wide_Text_IO?). Or when > using Gnat.Directory_Operations to enumerate files. > > I don't find information in the Gnat UG/RM about these things. Look again, in the GNAT Users Guide for "Foreign Language Representation." > What will these functions return? It's specified somewhere, or will they > pass the bytes from the underlying OS calls inside a String so I can't > know in advance what to expect? The real problems are in interpreting Strings and Wide_Strings and deciding when two Strings or Wide_Strings should compare true. As long as the canonicalization of the representations is outside your application, great. (For example, the OS probably provides a call for converting a Unicode string to a canonical representation.) Unless you really want to get deeply into writing Unicode (or ISO 10646-1) support, use whatever internationalization facilities the OS provides. Doing a better (or worse) job than the OS will get you no thanks, or even if you implement exactly the same rules and then the OS is updated. -- Robert I. Eachus "Quality is the Buddha. Quality is scientific reality. Quality is the goal of Art. It remains to work these concepts into a practical, down-to-earth context, and for this there is nothing more practical or down-to-earth than what I have been talking about all along...the repair of an old motorcycle." -- from Zen and the Art of Motorcycle Maintenance by Robert Pirsig ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Ada, Gnat and Unicode 2003-10-23 15:49 ` Robert I. Eachus @ 2003-10-23 17:38 ` Jano 2003-10-23 21:54 ` Robert I. Eachus 0 siblings, 1 reply; 7+ messages in thread From: Jano @ 2003-10-23 17:38 UTC (permalink / raw) Robert I. Eachus dice... (Snipped some interesting bits). > If you use UTF-8 for source input in GNAT, be aware that they only > support UTF-8 for BMP characters, full UTF-8 including 6 octet encodings > is not supported. (Note that all Unicode characters are effectively > supported in GNAT, although you will have to use two 16-bit encodings as > three octet sequences giving a six octet encoding...) Thanks for your reply, and now for some clarifications and more doubts ;) Firstly, I wasn't referring to me using anything outside of Latin1 for my source code. I think it will be best if I explain my problem better. I'm giving a try with an open source p2p protocol. It permits file searches by keyword. These keywords are filenames and/or metadata about the files. These data is exchanged UTF8 encoded. As you may be seeing now, I want to scan a folder and transform the filenames into UTF8. That's fine for me which know that I'm getting Latin1 encoded strings from the Directory_Operations package, and any metadata entered by the user. But I was wondering what would happen to a Chinese user (not that I foresee any usage of my program in wide deployment, but when faced with the problem one *must* know ;) > > I don't find information in the Gnat UG/RM about these things. > > Look again, in the GNAT Users Guide for "Foreign Language Representation." Correct me, that refers to source representation? (I had missed it anyway ^_^) (Of course if my program were to be translated, that applies. I'm not so concerned about this but I should have been clearer). As a final side note, my program is GUI-less, that's why I'm not concerned about translation. However it has a SOAP interface. With that I've plugged a Java GUI which correctly decodes and shows my UTF8 strings (a few traces and status reports). Thanks, -- ------------------------- Jano 402450.at.cepsz.unizar.es ------------------------- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Ada, Gnat and Unicode 2003-10-23 17:38 ` Jano @ 2003-10-23 21:54 ` Robert I. Eachus 2003-10-24 15:09 ` Jano 0 siblings, 1 reply; 7+ messages in thread From: Robert I. Eachus @ 2003-10-23 21:54 UTC (permalink / raw) Jano wrote: > Robert I. Eachus dice... > As you may be seeing now, I want to scan a folder and transform the > filenames into UTF8. That's fine for me which know that I'm getting > Latin1 encoded strings from the Directory_Operations package, and any > metadata entered by the user. But I was wondering what would happen to a > Chinese user (not that I foresee any usage of my program in wide > deployment, but when faced with the problem one *must* know ;) Remember my advice about canonicalization. If you get Unicode or UTF-8 file names from the OS, they may or may not be in a canonical form. If not, get the OS to do it for you. And of course, this information is OS specific. You won't really care what the OS's definition of canonical form is, just whether the strings you are getting are in that form, and if not how to call the OS to do that. >>Look again, in the GNAT Users Guide for "Foreign Language Representation." > > Correct me, that refers to source representation? (I had missed it > anyway ^_^) Yes, it refers to source representation, but if you think about it for a second, the source representation of non-Latin1 characters is an issue for Character and String literals. Otherwise the compiler doesn't care what Character type you use in your program. > (Of course if my program were to be translated, that applies. I'm not so > concerned about this but I should have been clearer). -- Robert I. Eachus "Quality is the Buddha. Quality is scientific reality. Quality is the goal of Art. It remains to work these concepts into a practical, down-to-earth context, and for this there is nothing more practical or down-to-earth than what I have been talking about all along...the repair of an old motorcycle." -- from Zen and the Art of Motorcycle Maintenance by Robert Pirsig ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Ada, Gnat and Unicode 2003-10-23 21:54 ` Robert I. Eachus @ 2003-10-24 15:09 ` Jano 0 siblings, 0 replies; 7+ messages in thread From: Jano @ 2003-10-24 15:09 UTC (permalink / raw) Robert I. Eachus dice... > Jano wrote: > > Robert I. Eachus dice... > > > As you may be seeing now, I want to scan a folder and transform the > > filenames into UTF8. That's fine for me which know that I'm getting > > Latin1 encoded strings from the Directory_Operations package, and any > > metadata entered by the user. But I was wondering what would happen to a > > Chinese user (not that I foresee any usage of my program in wide > > deployment, but when faced with the problem one *must* know ;) > > Remember my advice about canonicalization. If you get Unicode or UTF-8 > file names from the OS, they may or may not be in a canonical form. If > not, get the OS to do it for you. And of course, this information is OS > specific. You won't really care what the OS's definition of canonical > form is, just whether the strings you are getting are in that form, and > if not how to call the OS to do that. Ok, I see. In the end that's the outcome I didn't want to hear but the one I expected. > Yes, it refers to source representation, but if you think about it for a > second, the source representation of non-Latin1 characters is an issue > for Character and String literals. Otherwise the compiler doesn't care > what Character type you use in your program. I was referring to that too :) Thanks! -- ------------------------- Jano 402450.at.cepsz.unizar.es ------------------------- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Ada, Gnat and Unicode 2003-10-23 14:48 Ada, Gnat and Unicode Jano 2003-10-23 15:49 ` Robert I. Eachus @ 2003-10-24 4:01 ` Steve 2003-10-24 15:07 ` Jano 1 sibling, 1 reply; 7+ messages in thread From: Steve @ 2003-10-24 4:01 UTC (permalink / raw) A good place to start looking is to download XML/Ada and have a look at the unicode part. There appears to be extensive support there. Steve (The Duck) "Jano" <402450@cepsz.unizar.es> wrote in message news:5d6fdb61.0310230648.62219442@posting.google.com... > Hello sirs, > > I'm thinking about the best procedure to internationalize some Ada > program and I have some doubts. Please shed some light if you can. > > AFAIK, the Ada Character type is the 256 first values from ISO 10646 > (Latin1). In the same fashion, Wide_Character are the 2**16 values of > that same ISO. The ARM furthermore says that an implementation can > provide alternate representations conforming to local conventions, but > later it states that said representation should be a proper subset of > these two. I'm not very sure about what that implies. > > Some old discussion suggest that 10646 and Unicode are equivalent, but > it seems that later they dissociated. In any case Unicode is more than > the 2**16 values that Wide_character can hold so I'm not sure that > Wide_character is useful at all (?) > > Anyhow, I was thinking of using UTF8 encoding. That's convenient as it > can hold anything in the Unicode world, is space efficient, provides > good interoperability with other languages/Packages (GtkAda, Java, > ...). > > My doubt principally comes from behavior when you're not using a > Latin1 OS, for example a Chinese Windows. When you do some I/O, for > example a read from console with Text_IO.Get (Wide_Text_IO?). Or when > using Gnat.Directory_Operations to enumerate files. > > I don't find information in the Gnat UG/RM about these things. What > will these functions return? It's specified somewhere, or will they > pass the bytes from the underlying OS calls inside a String so I can't > know in advance what to expect? > > Thanks for any clarifications, > > Alex. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Ada, Gnat and Unicode 2003-10-24 4:01 ` Steve @ 2003-10-24 15:07 ` Jano 0 siblings, 0 replies; 7+ messages in thread From: Jano @ 2003-10-24 15:07 UTC (permalink / raw) Steve dice... > A good place to start looking is to download XML/Ada and have a look at the > unicode part. There appears to be extensive support there. I'm already using it for both Xml and Unicode purposes. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2003-10-24 15:09 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-10-23 14:48 Ada, Gnat and Unicode Jano 2003-10-23 15:49 ` Robert I. Eachus 2003-10-23 17:38 ` Jano 2003-10-23 21:54 ` Robert I. Eachus 2003-10-24 15:09 ` Jano 2003-10-24 4:01 ` Steve 2003-10-24 15:07 ` Jano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox