Re: UNICODE - non-Asian

comp.lang.ada
 help / color / mirror / Atom feed

From: dewar@merv.cs.nyu.edu (Robert Dewar)
Subject: Re: UNICODE - non-Asian
Date: 1998/05/23
Date: 1998-05-23T00:00:00+00:00	[thread overview]
Message-ID: <dewar.895928902@merv> (raw)
In-Reply-To: EACHUS.98May22161036@spectre.mitre.org

Robert Eachus quotes from the standard

ISO/IEC 8859-1:1998 Latin 1  (Yes, that is 1998!)
ISO/IEC 8859-2:1987 Latin 2
ISO/IEC 8859-3:1988 Latin 3
ISO/IEC 8859-4:1988 Latin 4
ISO/IEC 8859-5:1988 Latin/Cyrillic
ISO/IEC 8859-6:1987 Latin/Arabic
ISO/IEC 8859-7:1987 Latin/Greek
ISO/IEC 8859-8:1988 Latin/Hebrew
ISO/IEC 8859-9:1989 Latin 5
ISO/IEC 8859-10:1992 Latin 6

Note that in practice an Ada compiler that supports Latin-1 can be used
perfectly well for any of these subparts of the standard. In response
to some input command you type in Latin/Arabic as 8-bit codes, and it
gets stored internally as some gobbledygook Latin-1 stuff. But since you
write your character and string literals with the same translation, everything
is fine.

There are only two problems in practice:

The package Ada.Characters.Latin_1 is of limited use, e.g. its idea of
what a letter is is not useful. Of course you can write your own, or 
perhaps your vendor wlil supply an analogous package.

You can't use everything you think are letters in identifiers, and upper/lower
case equivalence may be peculiar (for example it may make two "letters" that
are quite distinct to you, be treated as the same in identifiers). 

It may be that the vendor supplies non-standard modes in which other codes
than Latin-1 are recognized for identifiers, in which case you can write
(potentially non-portable) code taking advantage of this.

In the absence of such special non-standard modes, or if you are concerned
about writing portable code, then you can simply stick to the lower half
of the ISO definition, which is the same in most parts.

In GNAT, we have not bothered to provide alternatives to the Latin_1
packages in the runtime, no one, not even a user of the public version,
has ever suggested that they wanted this, so the demand is close to zero.

We do provide non-standard modes for identifiers:

@item 1
Latin-1 identifiers

@item 2
Latin-2 letters allowed in identifiers

@item 3
Latin-3 letters allowed in identifiers

@item 4
Latin-4 letters allowed in identifiers

@item p
IBM PC letters (code page 437) allowed in identifiers

@item 8
IBM PC letters (code page 850) allowed in identifiers

@item f
Full upper-half codes allowed in identifiers

@item n
No upper-half codes allowed in identifiers

@item w
Wide-character codes allowed in identifiers
@end table

I put in the Latin-1/2/3/4 one day when I had nothing else I felt like
doing. I doubt that other than Latin-1 have ever been used.

I also put in page 437 PC stuff. A user commented that page 850 would
be useful in Europe and supplied the tables, so I put that in. But I
don't know if either have been used.

The full upper-half option is useful in China, and has been used at 
least once there.

THe no-upper half option is useful for ensuring portability.

The wide characters option is useful in Japan and has been used at
least a little bit there.

If anyone wants to supply additional tables for identifiers (see
csets.adb in the GNAT compiler sources), or additional alternative
packages for Ada.Characters.Latin_1, we could certainly include them.
I don't think this is the most urgent missing feature in GNAT :-)

By the way, I want to report that Markus Kuhn supplied the information
and a start towards the coding for recognizing UTF-8 in GNAT, and I have
just completed that coding, so GNAT will now fully support UTF-8, thanks
Markus for this contribution!

Robert Dewar
Ada Core Technologies

next prev parent reply	other threads:[~1998-05-23  0:00 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
1998-05-20  0:00 UNICODE - non-Asian William A Whitaker
1998-05-20  0:00 ` Robert Dewar
1998-05-22  0:00   ` Robert I. Eachus
1998-05-22  0:00     ` Markus Kuhn
1998-05-25  0:00       ` Samuel Tardieu
1998-05-26  0:00       ` Robert I. Eachus
1998-05-23  0:00     ` Robert Dewar [this message]
1998-05-24  0:00       ` Ronald Cole
1998-05-25  0:00         ` Robert Dewar
1998-06-01  0:00     ` Norman H. Cohen

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox