UNICODE - non-Asian

comp.lang.ada
 help / color / mirror / Atom feed

* UNICODE - non-Asian
@ 1998-05-20  0:00 William A Whitaker
  1998-05-20  0:00 ` Robert Dewar
  0 siblings, 1 reply; 10+ messages in thread
From: William A Whitaker @ 1998-05-20  0:00 UTC (permalink / raw)



If I can re-pulse the group on the original question.  My main problem
is Greek and Hebrew, not Japanese.  I agree that the Japanese are not
likely to favor UNICODE.

I get the impression that there is no applicable experience here.  If I
am wrong, please tell me.


Whitaker




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UNICODE - non-Asian
  1998-05-20  0:00 UNICODE - non-Asian William A Whitaker
@ 1998-05-20  0:00 ` Robert Dewar
  1998-05-22  0:00   ` Robert I. Eachus
  0 siblings, 1 reply; 10+ messages in thread
From: Robert Dewar @ 1998-05-20  0:00 UTC (permalink / raw)



Bill, you asked:

<<If I can re-pulse the group on the original question.  My main problem
is Greek and Hebrew, not Japanese.  I agree that the Japanese are not
likely to favor UNICODE.

I get the impression that there is no applicable experience here.  If I
am wrong, please tell me.
>>

Greek uses an 8-bit code. It is one of the family of 8-bit codes of
which Latin-1 is an example. Generally any compiler will support
use in a Greek environment without much fiddling. In addition,
GNAT provides the option of using Latin-1/Latin-2/Latin-3/Latin-4
as well as the IBM PC set (both code pages 437 and 850) for identifiers.
This is a non-standard feature (although this kind of non-standard
capability is very much anticipated by 3.5.2(4):

                         Implementation Permissions

4   In a nonstandard mode, an implementation may provide other
interpretations for the predefined types Character and Wide_Character, to
conform to local conventions.

)

The main effect of selecting one of these options in GNAT (they are fully
documented in the GNAT documentation) is that you get proper recognition
of the full set of "letters" with proper upper/lower case equivalence.

I am not sure what encodings are standard for Hebrew, someone here should
know. I am a little suprised if we don't support it already, since we
have a number of customers in Israel, and this subject has not come up :-)





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UNICODE - non-Asian
  1998-05-20  0:00 ` Robert Dewar
@ 1998-05-22  0:00   ` Robert I. Eachus
  1998-05-22  0:00     ` Markus Kuhn
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Robert I. Eachus @ 1998-05-22  0:00 UTC (permalink / raw)

In article <dewar.895708219@merv> dewar@merv.cs.nyu.edu (Robert Dewar) writes:

 > I am not sure what encodings are standard for Hebrew, someone here should
 > know. I am a little suprised if we don't support it already, since we
 > have a number of customers in Israel, and this subject has not come up :-)

  Latin/Hebrew ISO/IEC 8859-8 of course.  There are ten defined
sub-parts of 8859, Latin-1 is the first, covering almost all of
Western Europe, while Latin-2 is used by some central European
countries, etc.  There are four sets which cover other scripts,
Latin/Cyrillic, Latin/Greek, Latin/Arabic, and Latin/Hebrew.

  The "parts" of 8859 and their names:

ISO/IEC 8859-1:1998 Latin 1  (Yes, that is 1998!)
ISO/IEC 8859-2:1987 Latin 2  
ISO/IEC 8859-3:1988 Latin 3  
ISO/IEC 8859-4:1988 Latin 4
ISO/IEC 8859-5:1988 Latin/Cyrillic
ISO/IEC 8859-6:1987 Latin/Arabic
ISO/IEC 8859-7:1987 Latin/Greek
ISO/IEC 8859-8:1988 Latin/Hebrew
ISO/IEC 8859-9:1989 Latin 5
ISO/IEC 8859-10:1992 Latin 6

    Currently, there are three new parts under DIS balloting, 13,
14,and 15.  They are Latin 7, Latin 8 (Celtic), and Latin 0,
respectively.  Parts 2 through 10 currently are also being revised,
but I think that these revisions, and the recent revision to Latin-1
were to bring the documents up to date without significant changes.

   Incidentally, ISO 10646-1:1993 (Basic Multilingual Plane/Unicode)
now has two corrigenda and 19 amendments.  Aren't standards wonderful!

  Trivia question: What English letters were removed from Latin-1 in
the original 1987 version?  

  Real Trivia question: What English letter is not in Unicode?

--

					Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UNICODE - non-Asian
  1998-05-22  0:00   ` Robert I. Eachus
@ 1998-05-22  0:00     ` Markus Kuhn
  1998-05-25  0:00       ` Samuel Tardieu
  1998-05-26  0:00       ` Robert I. Eachus
  1998-05-23  0:00     ` Robert Dewar
  1998-06-01  0:00     ` Norman H. Cohen
  2 siblings, 2 replies; 10+ messages in thread
From: Markus Kuhn @ 1998-05-22  0:00 UTC (permalink / raw)



Robert I. Eachus wrote:
>   Trivia question: What English letters were removed from Latin-1 in
> the original 1987 version?

The oe and OE ligature (where we have now ï¿½ and ï¿½)?

>   Real Trivia question: What English letter is not in Unicode?

The copyleft sign?

Markus

-- 
Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK
email: mkuhn at acm.org,  home page: <http://www.cl.cam.ac.uk/~mgk25/>




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UNICODE - non-Asian
  1998-05-22  0:00     ` Markus Kuhn
@ 1998-05-25  0:00       ` Samuel Tardieu
  1998-05-26  0:00       ` Robert I. Eachus
  1 sibling, 0 replies; 10+ messages in thread
From: Samuel Tardieu @ 1998-05-25  0:00 UTC (permalink / raw)



>>>>> "Markus" == Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> writes:

Markus> The oe and OE ligature (where we have now ï¿½ and ï¿½)?

Fortunately, these ligatures (well, not everyone agrees that they are
ligatures in french) will be included in the forthcoming Latin-0.

  Sam
-- 
Samuel Tardieu -- sam@ada.eu.org




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UNICODE - non-Asian
  1998-05-22  0:00     ` Markus Kuhn
  1998-05-25  0:00       ` Samuel Tardieu
@ 1998-05-26  0:00       ` Robert I. Eachus
  1 sibling, 0 replies; 10+ messages in thread
From: Robert I. Eachus @ 1998-05-26  0:00 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 700 bytes --]


In article <356603AC.79DB5014@cl.cam.ac.uk> Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> writes:

   > The oe and OE ligature (where we have now � and �)?

   Yep!

   >   Real Trivia question: What English letter is not in Unicode?

   The copyleft sign?

   No, oe with diaeresis (two dots) over the o.  It appears very
rarely, but in one case, a village in Brittany, it appears with the O
capitalized, and the e lower case.  (When AE or OE appear as the first
letter in a capitalized English word, it is always the case that both
are capitialized.)
--

					Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UNICODE - non-Asian
  1998-05-22  0:00   ` Robert I. Eachus
  1998-05-22  0:00     ` Markus Kuhn
@ 1998-05-23  0:00     ` Robert Dewar
  1998-05-24  0:00       ` Ronald Cole
  1998-06-01  0:00     ` Norman H. Cohen
  2 siblings, 1 reply; 10+ messages in thread
From: Robert Dewar @ 1998-05-23  0:00 UTC (permalink / raw)



Robert Eachus quotes from the standard

ISO/IEC 8859-1:1998 Latin 1  (Yes, that is 1998!)
ISO/IEC 8859-2:1987 Latin 2
ISO/IEC 8859-3:1988 Latin 3
ISO/IEC 8859-4:1988 Latin 4
ISO/IEC 8859-5:1988 Latin/Cyrillic
ISO/IEC 8859-6:1987 Latin/Arabic
ISO/IEC 8859-7:1987 Latin/Greek
ISO/IEC 8859-8:1988 Latin/Hebrew
ISO/IEC 8859-9:1989 Latin 5
ISO/IEC 8859-10:1992 Latin 6


Note that in practice an Ada compiler that supports Latin-1 can be used
perfectly well for any of these subparts of the standard. In response
to some input command you type in Latin/Arabic as 8-bit codes, and it
gets stored internally as some gobbledygook Latin-1 stuff. But since you
write your character and string literals with the same translation, everything
is fine.

There are only two problems in practice:

The package Ada.Characters.Latin_1 is of limited use, e.g. its idea of
what a letter is is not useful. Of course you can write your own, or 
perhaps your vendor wlil supply an analogous package.

You can't use everything you think are letters in identifiers, and upper/lower
case equivalence may be peculiar (for example it may make two "letters" that
are quite distinct to you, be treated as the same in identifiers). 

It may be that the vendor supplies non-standard modes in which other codes
than Latin-1 are recognized for identifiers, in which case you can write
(potentially non-portable) code taking advantage of this.

In the absence of such special non-standard modes, or if you are concerned
about writing portable code, then you can simply stick to the lower half
of the ISO definition, which is the same in most parts.

In GNAT, we have not bothered to provide alternatives to the Latin_1
packages in the runtime, no one, not even a user of the public version,
has ever suggested that they wanted this, so the demand is close to zero.

We do provide non-standard modes for identifiers:

@item 1
Latin-1 identifiers

@item 2
Latin-2 letters allowed in identifiers

@item 3
Latin-3 letters allowed in identifiers

@item 4
Latin-4 letters allowed in identifiers

@item p
IBM PC letters (code page 437) allowed in identifiers

@item 8
IBM PC letters (code page 850) allowed in identifiers

@item f
Full upper-half codes allowed in identifiers

@item n
No upper-half codes allowed in identifiers

@item w
Wide-character codes allowed in identifiers
@end table



I put in the Latin-1/2/3/4 one day when I had nothing else I felt like
doing. I doubt that other than Latin-1 have ever been used.

I also put in page 437 PC stuff. A user commented that page 850 would
be useful in Europe and supplied the tables, so I put that in. But I
don't know if either have been used.

The full upper-half option is useful in China, and has been used at 
least once there.

THe no-upper half option is useful for ensuring portability.

The wide characters option is useful in Japan and has been used at
least a little bit there.

If anyone wants to supply additional tables for identifiers (see
csets.adb in the GNAT compiler sources), or additional alternative
packages for Ada.Characters.Latin_1, we could certainly include them.
I don't think this is the most urgent missing feature in GNAT :-)

By the way, I want to report that Markus Kuhn supplied the information
and a start towards the coding for recognizing UTF-8 in GNAT, and I have
just completed that coding, so GNAT will now fully support UTF-8, thanks
Markus for this contribution!

Robert Dewar
Ada Core Technologies





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UNICODE - non-Asian
  1998-05-23  0:00     ` Robert Dewar
@ 1998-05-24  0:00       ` Ronald Cole
  1998-05-25  0:00         ` Robert Dewar
  0 siblings, 1 reply; 10+ messages in thread
From: Ronald Cole @ 1998-05-24  0:00 UTC (permalink / raw)



dewar@merv.cs.nyu.edu (Robert Dewar) writes:
> The full upper-half option is useful in China, and has been used at 
> least once there.

Would that be in the code controlling the guidance systems on the
thirteen nukes they've purported pointed at the US?  ;)

-- 
Forte International, P.O. Box 1412, Ridgecrest, CA  93556-1412
Ronald Cole <ronald@forte-intl.com>      Phone: (760) 499-9142
President, CEO                             Fax: (760) 499-9152
My PGP fingerprint: E9 A8 E3 68 61 88 EF 43  56 2B CE 3E E9 8F 3F 2B




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UNICODE - non-Asian
  1998-05-24  0:00       ` Ronald Cole
@ 1998-05-25  0:00         ` Robert Dewar
  0 siblings, 0 replies; 10+ messages in thread
From: Robert Dewar @ 1998-05-25  0:00 UTC (permalink / raw)



Ronald COle asks

<<> The full upper-half option is useful in China, and has been used at
> least once there.

Would that be in the code controlling the guidance systems on the
thirteen nukes they've purported pointed at the US?  ;)
>>

Not unless these systems are run on PC's using WIndows 95, which seems unlikely
despite Microsoft's interest in being the supplier of everyones operating
system.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UNICODE - non-Asian
  1998-05-22  0:00   ` Robert I. Eachus
  1998-05-22  0:00     ` Markus Kuhn
  1998-05-23  0:00     ` Robert Dewar
@ 1998-06-01  0:00     ` Norman H. Cohen
  2 siblings, 0 replies; 10+ messages in thread
From: Norman H. Cohen @ 1998-06-01  0:00 UTC (permalink / raw)



Robert I. Eachus wrote:
> 
> In article <dewar.895708219@merv> dewar@merv.cs.nyu.edu (Robert Dewar) writes:
> 
>  > I am not sure what encodings are standard for Hebrew, someone here should
>  > know. I am a little suprised if we don't support it already, since we
>  > have a number of customers in Israel, and this subject has not come up :-)
> 
>   Latin/Hebrew ISO/IEC 8859-8 of course.

That is correct.  It is in common use, for example, in Hebrew web pages.

(I must say, however, that I find Bill Whitaker's characterization of
Hebrew as a "non-Asian" language most curious!  :-) )

-- 
Norman H. Cohen
mailto:ncohen@watson.ibm.com
http://www.research.ibm.com/people/n/ncohen




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~1998-06-01  0:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-05-20  0:00 UNICODE - non-Asian William A Whitaker
1998-05-20  0:00 ` Robert Dewar
1998-05-22  0:00   ` Robert I. Eachus
1998-05-22  0:00     ` Markus Kuhn
1998-05-25  0:00       ` Samuel Tardieu
1998-05-26  0:00       ` Robert I. Eachus
1998-05-23  0:00     ` Robert Dewar
1998-05-24  0:00       ` Ronald Cole
1998-05-25  0:00         ` Robert Dewar
1998-06-01  0:00     ` Norman H. Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox