comp.lang.ada
 help / color / mirror / Atom feed
From: dvdeug@x8b4e53cd.dhcp.okstate.edu (David Starner)
Subject: Re: Hebrew language character set
Date: 5 Apr 2001 20:43:08 GMT
Date: 2001-04-05T20:43:08+00:00	[thread overview]
Message-ID: <9ailcs$8a22@news.cis.okstate.edu> (raw)
In-Reply-To: 3ACCB8F8.C7153457@adapower.net

On Thu, 05 Apr 2001 13:27:04 -0500, Britt Snodgrass <britt@adapower.net> wrote:
>The URL you used points to a very old/obsolete version of the GNAT
>reference.  See the "Wide Character Encodings" section of a GNAT 3.13p
>or 3.14a users manual. 
[...]
>From the GNAT 3.14a Users Guide:
>
>Wide Character Encodings
>
>GNAT allows wide character codes to appear in character and string
>literals, and also optionally in identifiers, by means of the following
>possible encoding schemes: 

"in ... literals, and ... in identifiers" clearly shows that this 
section of the GNAT User Guide is talking about the representation
of the source, not I/O. Try

Wide_Text_IO
============

`Wide_Text_IO' is similar in most respects to Text_IO, except that both
input and output files may contain special sequences that represent
wide character values. The encoding scheme for a given file may be
specified using a FORM parameter:

     WCEM=X

as part of the FORM string (WCEM = wide character encoding method),
where X is one of the following characters

`h'
     Hex ESC encoding

`u'
     Upper half encoding

`s'
     Shift-JIS encoding

`e'
     EUC Encoding

`8'
     UTF-8 encoding

`b'
     Brackets encoding

   The encoding methods match those that can be used in a source
program, but there is no requirement that the encoding method used for
the source program be the same as the encoding method used for files,
and different files may use different encoding methods.

   The default encoding method for the standard files, and for opened
files for which no WCEM parameter is given in the FORM string matches
the wide character encoding specified for the main program (the default
being brackets encoding if no coding method was specified with -gnatW).

Hex Coding
     In this encoding, a wide character is represented by a five
     character sequence:

          ESC a b c d

     where A, B, C, D are the four hexadecimal characters (using upper
     case letters) of the wide character code. For example, ESC A345 is
     used to represent the wide character with code 16#A345#. This
     scheme is compatible with use of the full `Wide_Character' set.

Upper Half Coding
     The wide character with encoding 16#abcd#, where the upper bit is
     on (i.e. a is in the range 8-F) is represented as two bytes 16#ab#
     and 16#cd#. The second byte may never be a format control
     character, but is not required to be in the upper half. This
     method can be also used for shift-JIS or EUC where the internal
     coding matches the external coding.

Shift JIS Coding
     A wide character is represented by a two character sequence 16#ab#
     and 16#cd#, with the restrictions described for upper half
     encoding as described above. The internal character code is the
     corresponding JIS character according to the standard algorithm
     for Shift-JIS conversion. Only characters defined in the JIS code
     set table can be used with this encoding method.

EUC Coding
     A wide character is represented by a two character sequence 16#ab#
     and 16#cd#, with both characters being in the upper half. The
     internal character code is the corresponding JIS character
     according to the EUC encoding algorithm. Only characters defined
     in the JIS code set table can be used with this encoding method.

UTF-8 Coding
     A wide character is represented using UCS Transformation Format 8
     (UTF-8) as defined in Annex R of ISO 10646-1/Am.2.  Depending on
     the character value, the representation is a one, two, or three
     byte sequence:

          16#0000#-16#007f#: 2#0xxxxxxx#
          16#0080#-16#07ff#: 2#110xxxxx# 2#10xxxxxx#
          16#0800#-16#ffff#: 2#1110xxxx# 2#10xxxxxx# 2#10xxxxxx#

     where the xxx bits correspond to the left-padded bits of the the
     16-bit character value. Note that all lower half ASCII characters
     are represented as ASCII bytes and all upper half characters and
     other wide characters are represented as sequences of upper-half
     (The full UTF-8 scheme allows for encoding 31-bit characters as
     6-byte sequences, but in this implementation, all UTF-8 sequences
     of four or more bytes length will raise a Constraint_Error, as
     will all illegal UTF-8 sequences.)

Brackets Coding
     In this encoding, a wide character is represented by the following
     eight character sequence:

          [ " a b c d " ]
     Where `a', `b', `c', `d' are the four hexadecimal characters
     (using uppercase letters) of the wide character code. For example,
     `["A345"]' is used to represent the wide character with code
     `16#A345#'.  This scheme is compatible with use of the full
     Wide_Character set.  On input, brackets coding can also be used
     for upper half characters, e.g. `["C1"]' for lower case a.
     However, on output, brackets notation is only used for wide
     characters with a code greater than `16#FF#'.

   For the coding schemes other than Hex and Brackets encoding, not all
wide character values can be represented. An attempt to output a
character that cannot be represented using the encoding scheme for the
file causes Constraint_Error to be raised. An invalid wide character
sequence on input also causes Constraint_Error to be raised.

[...]
======================================================================

(I must, however, apologize to the person to used -gnatW8 to change the
output, and I claimed it only changed source encoding. It's clear I 
didn't read this section well enough, because, for better or worse,
it changes the default encoding. It must get interesting when the program 
is compiled with different default encodings for each file, though ...)

-- 
David Starner - dstarner98@aasaa.ofe.org
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg



  reply	other threads:[~2001-04-05 20:43 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-04-03 19:08 Hebrew language character set Paul Storm
2001-04-03 19:42 ` Florian Weimer
2001-04-03 23:05   ` Paul Storm
2001-04-04  3:09     ` David Starner
2001-04-04  9:20     ` Florian Weimer
2001-04-04 17:35 ` David Botton
2001-04-04 19:26   ` Paul Storm
2001-04-04 21:36   ` Paul Storm
2001-04-05  3:03     ` David Starner
2001-04-05  6:42     ` Ehud Lamm
2001-04-05 16:46       ` Paul Storm
2001-04-05 13:11     ` Jean-Marc Bourguet
2001-04-05 16:56       ` Paul Storm
2001-04-05 16:41         ` Florian Weimer
2001-04-05 18:23           ` Paul Storm
2001-04-05 18:27             ` Britt Snodgrass
2001-04-05 20:43               ` David Starner [this message]
2001-04-06 21:28                 ` Florian Weimer
2001-04-05 18:38             ` Florian Weimer
2001-04-05 18:36           ` David Starner
2001-04-06 21:26             ` Florian Weimer
2001-04-05 18:41           ` Paul Storm
2001-04-06  9:32             ` Florian Weimer
2001-04-05 18:35         ` David Starner
2001-04-06 18:10           ` Ayende Rahien
2001-04-06 22:27             ` David Starner
2001-04-08 19:03               ` Robert A Duff
2001-04-07  5:12             ` Florian Weimer
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox