comp.lang.ada
 help / color / mirror / Atom feed
* Ada, Gnat and Unicode
@ 2003-10-23 14:48 Jano
  2003-10-23 15:49 ` Robert I. Eachus
  2003-10-24  4:01 ` Steve
  0 siblings, 2 replies; 7+ messages in thread
From: Jano @ 2003-10-23 14:48 UTC (permalink / raw)


Hello sirs,

I'm thinking about the best procedure to internationalize some Ada
program and I have some doubts. Please shed some light if you can.

AFAIK, the Ada Character type is the 256 first values from ISO 10646
(Latin1). In the same fashion, Wide_Character are the 2**16 values of
that same ISO. The ARM furthermore says that an implementation can
provide alternate representations conforming to local conventions, but
later it states that said representation should be a proper subset of
these two. I'm not very sure about what that implies.

Some old discussion suggest that 10646 and Unicode are equivalent, but
it seems that later they dissociated. In any case Unicode is more than
the 2**16 values that Wide_character can hold so I'm not sure that
Wide_character is useful at all (?)

Anyhow, I was thinking of using UTF8 encoding. That's convenient as it
can hold anything in the Unicode world, is space efficient, provides
good interoperability with other languages/Packages (GtkAda, Java,
...).

My doubt principally comes from behavior when you're not using a
Latin1 OS, for example a Chinese Windows. When you do some I/O, for
example a read from console with Text_IO.Get (Wide_Text_IO?). Or when
using Gnat.Directory_Operations to enumerate files.

I don't find information in the Gnat UG/RM about these things. What
will these functions return? It's specified somewhere, or will they
pass the bytes from the underlying OS calls inside a String so I can't
know in advance what to expect?

Thanks for any clarifications,

Alex.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ada, Gnat and Unicode
  2003-10-23 14:48 Ada, Gnat and Unicode Jano
@ 2003-10-23 15:49 ` Robert I. Eachus
  2003-10-23 17:38   ` Jano
  2003-10-24  4:01 ` Steve
  1 sibling, 1 reply; 7+ messages in thread
From: Robert I. Eachus @ 2003-10-23 15:49 UTC (permalink / raw)


Jano wrote:

> I'm thinking about the best procedure to internationalize some Ada
> program and I have some doubts. Please shed some light if you can.

Okay.

> AFAIK, the Ada Character type is the 256 first values from ISO 10646
> (Latin1). In the same fashion, Wide_Character are the 2**16 values of
> that same ISO. The ARM furthermore says that an implementation can
> provide alternate representations conforming to local conventions, but
> later it states that said representation should be a proper subset of
> these two. I'm not very sure about what that implies.

First, that is correct.  By default Standard.Character is Latin1.  Some 
compilers, such as GNAT allow using other mappings.

Second, what it means by the Implementation Advice is just that. It is a 
"nice to have" feature that if you choose say Latin2 there is a defined 
mapping from Character to Wide_Character.  If you choose some other 
character set that is not in the BMP, it may not be possible. (For 
example Klingon, or Japanese Shift-JIS. ;-) All this says is vendors, 
please, if the mapping makes sense, provide it.  And in fact the GNAT RM 
does document under Implementation Advice, that JIS and IEC Japanese 
encodings do not follow it, because for these two encodings, it doesn't 
make sense to do so.

> Some old discussion suggest that 10646 and Unicode are equivalent, but
> it seems that later they dissociated. In any case Unicode is more than
> the 2**16 values that Wide_character can hold so I'm not sure that
> Wide_character is useful at all (?)

The best way to describe the relationship between ISO 10646-1 and 
Unicode is that the BMP (and some other planes of ISO 10646-1) are 
exactly mapped to Unicode and vice-versa.  Unicode adds some things as 
part of the standard that are not part of ISO 10646-1 and vice-versa, 
but these areas where the standards differ can be for the most part 
ignored.  For example, the ISO 10646 definition of UTF-8 allows for 
representing any (4 octet, 32-bit) character in UTF-8, while the Unicode 
standard only covers the encoding for Unicode.

The practical effect of this is that characters outside the BMP but in 
Unicode have at least two potential representations.  But if you get 
that far, you have already had to deal with the alternate 
representations of characters in the BMP through composition.  (For 
example adding a cedilla to a "c".)  Also, Unicode is stricter in 
determining which encodings should and should not be used.

If you use UTF-8 for source input in GNAT, be aware that they only 
support UTF-8 for BMP characters, full UTF-8 including 6 octet encodings 
is not supported.  (Note that all Unicode characters are effectively 
supported in GNAT, although you will have to use two 16-bit encodings as 
three octet sequences giving a six octet encoding...)

> Anyhow, I was thinking of using UTF8 encoding. That's convenient as it
> can hold anything in the Unicode world, is space efficient, provides
> good interoperability with other languages/Packages (GtkAda, Java,
> ...).
> 
> My doubt principally comes from behavior when you're not using a
> Latin1 OS, for example a Chinese Windows. When you do some I/O, for
> example a read from console with Text_IO.Get (Wide_Text_IO?). Or when
> using Gnat.Directory_Operations to enumerate files.
> 
> I don't find information in the Gnat UG/RM about these things.

Look again, in the GNAT Users Guide for "Foreign Language Representation."

> What will these functions return? It's specified somewhere, or will they
> pass the bytes from the underlying OS calls inside a String so I can't
> know in advance what to expect?

The real problems are in interpreting Strings and Wide_Strings and 
deciding when two Strings or Wide_Strings should compare true.  As long 
as the canonicalization of the representations is outside your 
application, great.  (For example, the OS probably provides a call for 
converting a Unicode string to a canonical representation.)  Unless you 
really want to get deeply into writing Unicode (or ISO 10646-1) support, 
use whatever internationalization facilities the OS provides.  Doing a 
better (or worse) job than the OS will get you no thanks, or even if you 
implement exactly the same rules and then the OS is updated.

-- 
                                                     Robert I. Eachus

"Quality is the Buddha. Quality is scientific reality. Quality is the 
goal of Art. It remains to work these concepts into a practical, 
down-to-earth context, and for this there is nothing more practical or 
down-to-earth than what I have been talking about all along...the repair 
of an old motorcycle."  -- from Zen and the Art of Motorcycle 
Maintenance by Robert Pirsig




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ada, Gnat and Unicode
  2003-10-23 15:49 ` Robert I. Eachus
@ 2003-10-23 17:38   ` Jano
  2003-10-23 21:54     ` Robert I. Eachus
  0 siblings, 1 reply; 7+ messages in thread
From: Jano @ 2003-10-23 17:38 UTC (permalink / raw)


Robert I. Eachus dice...

(Snipped some interesting bits).

> If you use UTF-8 for source input in GNAT, be aware that they only 
> support UTF-8 for BMP characters, full UTF-8 including 6 octet encodings 
> is not supported.  (Note that all Unicode characters are effectively 
> supported in GNAT, although you will have to use two 16-bit encodings as 
> three octet sequences giving a six octet encoding...)

Thanks for your reply, and now for some clarifications and more doubts 
;)

Firstly, I wasn't referring to me using anything outside of Latin1 for 
my source code. I think it will be best if I explain my problem better.

I'm giving a try with an open source p2p protocol. It permits file 
searches by keyword. These keywords are filenames and/or metadata about 
the files. These data is exchanged UTF8 encoded.

As you may be seeing now, I want to scan a folder and transform the 
filenames into UTF8. That's fine for me which know that I'm getting 
Latin1 encoded strings from the Directory_Operations package, and any 
metadata entered by the user. But I was wondering what would happen to a 
Chinese user (not that I foresee any usage of my program in wide 
deployment, but when faced with the problem one *must* know ;)

> > I don't find information in the Gnat UG/RM about these things.
> 
> Look again, in the GNAT Users Guide for "Foreign Language Representation."

Correct me, that refers to source representation? (I had missed it 
anyway ^_^)

(Of course if my program were to be translated, that applies. I'm not so 
concerned about this but I should have been clearer).

As a final side note, my program is GUI-less, that's why I'm not 
concerned about translation. However it has a SOAP interface. With that 
I've plugged a Java GUI which correctly decodes and shows my UTF8 
strings (a few traces and status reports).

Thanks,

-- 
-------------------------
Jano
402450.at.cepsz.unizar.es
-------------------------



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ada, Gnat and Unicode
  2003-10-23 17:38   ` Jano
@ 2003-10-23 21:54     ` Robert I. Eachus
  2003-10-24 15:09       ` Jano
  0 siblings, 1 reply; 7+ messages in thread
From: Robert I. Eachus @ 2003-10-23 21:54 UTC (permalink / raw)


Jano wrote:
> Robert I. Eachus dice...

> As you may be seeing now, I want to scan a folder and transform the 
> filenames into UTF8. That's fine for me which know that I'm getting 
> Latin1 encoded strings from the Directory_Operations package, and any 
> metadata entered by the user. But I was wondering what would happen to a 
> Chinese user (not that I foresee any usage of my program in wide 
> deployment, but when faced with the problem one *must* know ;)

Remember my advice about canonicalization.  If you get Unicode or UTF-8 
file names from the OS, they may or may not be in a canonical form.  If 
not, get the OS to do it for you.  And of course, this information is OS 
specific. You won't really care what the OS's definition of canonical 
form is, just whether the strings you are getting are in that form, and 
if not how to call the OS to do that.

>>Look again, in the GNAT Users Guide for "Foreign Language Representation."
>  
> Correct me, that refers to source representation? (I had missed it 
> anyway ^_^)

Yes, it refers to source representation, but if you think about it for a 
second, the source representation of non-Latin1 characters is an issue 
for Character and String literals.  Otherwise the compiler doesn't care 
what Character type you use in your program.

> (Of course if my program were to be translated, that applies. I'm not so 
> concerned about this but I should have been clearer).

-- 
                                                     Robert I. Eachus

"Quality is the Buddha. Quality is scientific reality. Quality is the 
goal of Art. It remains to work these concepts into a practical, 
down-to-earth context, and for this there is nothing more practical or 
down-to-earth than what I have been talking about all along...the repair 
of an old motorcycle."  -- from Zen and the Art of Motorcycle 
Maintenance by Robert Pirsig




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ada, Gnat and Unicode
  2003-10-23 14:48 Ada, Gnat and Unicode Jano
  2003-10-23 15:49 ` Robert I. Eachus
@ 2003-10-24  4:01 ` Steve
  2003-10-24 15:07   ` Jano
  1 sibling, 1 reply; 7+ messages in thread
From: Steve @ 2003-10-24  4:01 UTC (permalink / raw)


A good place to start looking is to download XML/Ada and have a look at the
unicode part.  There appears to be extensive support there.

Steve
(The Duck)


"Jano" <402450@cepsz.unizar.es> wrote in message
news:5d6fdb61.0310230648.62219442@posting.google.com...
> Hello sirs,
>
> I'm thinking about the best procedure to internationalize some Ada
> program and I have some doubts. Please shed some light if you can.
>
> AFAIK, the Ada Character type is the 256 first values from ISO 10646
> (Latin1). In the same fashion, Wide_Character are the 2**16 values of
> that same ISO. The ARM furthermore says that an implementation can
> provide alternate representations conforming to local conventions, but
> later it states that said representation should be a proper subset of
> these two. I'm not very sure about what that implies.
>
> Some old discussion suggest that 10646 and Unicode are equivalent, but
> it seems that later they dissociated. In any case Unicode is more than
> the 2**16 values that Wide_character can hold so I'm not sure that
> Wide_character is useful at all (?)
>
> Anyhow, I was thinking of using UTF8 encoding. That's convenient as it
> can hold anything in the Unicode world, is space efficient, provides
> good interoperability with other languages/Packages (GtkAda, Java,
> ...).
>
> My doubt principally comes from behavior when you're not using a
> Latin1 OS, for example a Chinese Windows. When you do some I/O, for
> example a read from console with Text_IO.Get (Wide_Text_IO?). Or when
> using Gnat.Directory_Operations to enumerate files.
>
> I don't find information in the Gnat UG/RM about these things. What
> will these functions return? It's specified somewhere, or will they
> pass the bytes from the underlying OS calls inside a String so I can't
> know in advance what to expect?
>
> Thanks for any clarifications,
>
> Alex.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ada, Gnat and Unicode
  2003-10-24  4:01 ` Steve
@ 2003-10-24 15:07   ` Jano
  0 siblings, 0 replies; 7+ messages in thread
From: Jano @ 2003-10-24 15:07 UTC (permalink / raw)


Steve dice...
> A good place to start looking is to download XML/Ada and have a look at the
> unicode part.  There appears to be extensive support there.

I'm already using it for both Xml and Unicode purposes. 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ada, Gnat and Unicode
  2003-10-23 21:54     ` Robert I. Eachus
@ 2003-10-24 15:09       ` Jano
  0 siblings, 0 replies; 7+ messages in thread
From: Jano @ 2003-10-24 15:09 UTC (permalink / raw)


Robert I. Eachus dice...
> Jano wrote:
> > Robert I. Eachus dice...
> 
> > As you may be seeing now, I want to scan a folder and transform the 
> > filenames into UTF8. That's fine for me which know that I'm getting 
> > Latin1 encoded strings from the Directory_Operations package, and any 
> > metadata entered by the user. But I was wondering what would happen to a 
> > Chinese user (not that I foresee any usage of my program in wide 
> > deployment, but when faced with the problem one *must* know ;)
> 
> Remember my advice about canonicalization.  If you get Unicode or UTF-8 
> file names from the OS, they may or may not be in a canonical form.  If 
> not, get the OS to do it for you.  And of course, this information is OS 
> specific. You won't really care what the OS's definition of canonical 
> form is, just whether the strings you are getting are in that form, and 
> if not how to call the OS to do that.

Ok, I see. In the end that's the outcome I didn't want to hear but the 
one I expected.

> Yes, it refers to source representation, but if you think about it for a 
> second, the source representation of non-Latin1 characters is an issue 
> for Character and String literals.  Otherwise the compiler doesn't care 
> what Character type you use in your program.

I was referring to that too :)

Thanks!

-- 
-------------------------
Jano
402450.at.cepsz.unizar.es
-------------------------



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-10-24 15:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-23 14:48 Ada, Gnat and Unicode Jano
2003-10-23 15:49 ` Robert I. Eachus
2003-10-23 17:38   ` Jano
2003-10-23 21:54     ` Robert I. Eachus
2003-10-24 15:09       ` Jano
2003-10-24  4:01 ` Steve
2003-10-24 15:07   ` Jano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox