Wide_String, Chinese & Japanese text files

comp.lang.ada
 help / color / mirror / Atom feed

* Wide_String, Chinese & Japanese text files
@ 1999-08-20  0:00 Thierry Lelegard
  1999-08-21  0:00 ` Robert Dewar
  0 siblings, 1 reply; 6+ messages in thread
From: Thierry Lelegard @ 1999-08-20  0:00 UTC (permalink / raw)


Hello,

We are going to need to process, in Ada 95, text files
containing Chinese and Japanese messages (for i18n purpose).
I have absolutely no experience in handling this kind of files.
I do not even know the usual format of that kind of text
files (8 or 16 bits/char).

Due to the amount of possible combinations, I assume
that at least some these languages require 16 bits per
character.

So, before appointing people to write the messages, I
have one request and one question.

1) Could anyone e-mail me a text file containing a
typical example of 16 bits characters Chinese or
Japanese text file, preferably from both UNIX
Windows worlds if there are some incompatibilities
such as the traditional LF vs CR/LF ?

2) How could I handle this in Ada? I naively though
that Ada.Wide_Text_IO would read 16 bits per character.
However (at least with gnat), it writes and read 8 bits
characters with "bracket coding" (as in Wide_String
literals). Of course, Sequential_IO on Wide_Character
or some kind of Stream_IO could do the trick but I
wonder if there some "standard" or at least "usual"
way to deal with this.

I must precise that we do not need to handle 16 bits
Ada source files, simply text files containing messages.
We already have the combined String/Wide_String support
in our applications, we simply need to choose the best
way to get the data from a file to a Wide_String. Then,
the Wide_Strings will be sent in SNMP messages.

Thank you all in advance.

-Thierry
________________________________________________________
Thierry Lelegard, Paris, France
E-mail: lelegard@club-internet.fr






^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Wide_String, Chinese & Japanese text files
  1999-08-20  0:00 Wide_String, Chinese & Japanese text files Thierry Lelegard
@ 1999-08-21  0:00 ` Robert Dewar
  1999-08-21  0:00   ` Thierry Lelegard
  0 siblings, 1 reply; 6+ messages in thread
From: Robert Dewar @ 1999-08-21  0:00 UTC (permalink / raw)


In article <7pka2j$lnn$1@front2.grolier.fr>,
  "Thierry Lelegard" <lelegard@club-internet.fr> wrote:
> We are going to need to process, in Ada 95, text files
> containing Chinese and Japanese messages (for i18n purpose).
> I have absolutely no experience in handling this kind of
files.
> I do not even know the usual format of that kind of text
> files (8 or 16 bits/char).

There is no "usual format", there are many possible formats.
GNAT supports:

    Upper half coding (sometimes used for chinese, never as
    far as I know for japanese)

    Shift JIS coding (a common Japanese convention)

    EUC coding (another common Japanese convention)

    UTF-8 coding (an ISO standard, never seen it used in
    practice, but probably is, and will be more over time?)

    Brackets coding (a portable ASCII coding, primarily useful
    for standard texts, e.g. the ACVC tests, not used for real
    data information interchange).

    ESC coding (another very simple ASCII portable coding, using
    an ESC character instead of brackets, again, not used for
    real data information interchange.

> Due to the amount of possible combinations, I assume
> that at least some these languages require 16 bits per
> character.

You really need to familiarize yourself with the relevant
ISO standard, and with Unicode, yes, of course 16-bits
are required (in fact for full Chinese support, 32-bits
are required, Unicode supports only a subset of Chinese).

> So, before appointing people to write the messages, I
> have one request and one question.
>
> 1) Could anyone e-mail me a text file containing a
> typical example of 16 bits characters Chinese or
> Japanese text file, preferably from both UNIX
> Windows worlds if there are some incompatibilities
> such as the traditional LF vs CR/LF ?

No such thing, you really must find out the source encoding,
since there are multiple possibilities.

> 2) How could I handle this in Ada? I naively though
> that Ada.Wide_Text_IO would read 16 bits per character.
> However (at least with gnat), it writes and read 8 bits
> characters with "bracket coding" (as in Wide_String
> literals). Of course, Sequential_IO on Wide_Character
> or some kind of Stream_IO could do the trick but I
> wonder if there some "standard" or at least "usual"
> way to deal with this.

You really need to know much more than you do to be successful
here. I strongly suggest you contact your vendor for assistance.
In the case of GNAT, especially for a Japanese environment we
can give a lot of help to GNAT Professional users, since our
Japanese Distributor (Jun Shimura) has extensive technical
experience in this area (in fact we worked closely with him
to ensure that our implementations of Shift-JIS and EUC were
correct).

If your compiler supports only the brackets encoding standard,
it is probably useless for your purposes, and you should look
around for a compiler that supports the format in which your
messages will be processed.

> I must precise that we do not need to handle 16 bits
> Ada source files, simply text files containing messages.
> We already have the combined String/Wide_String support
> in our applications, we simply need to choose the best
> way to get the data from a file to a Wide_String. Then,
> the Wide_Strings will be sent in SNMP messages.

Robert Dewar
Ada Core Technologies


Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Wide_String, Chinese & Japanese text files
  1999-08-21  0:00 ` Robert Dewar
@ 1999-08-21  0:00   ` Thierry Lelegard
  1999-08-22  0:00     ` Florian Weimer
  1999-08-22  0:00     ` Robert Dewar
  0 siblings, 2 replies; 6+ messages in thread
From: Thierry Lelegard @ 1999-08-21  0:00 UTC (permalink / raw)


Hello Mr Dewar,

> You really need to familiarize yourself with the relevant
> ISO standard, and with Unicode, yes, of course 16-bits
> are required (in fact for full Chinese support, 32-bits
> are required, Unicode supports only a subset of Chinese).

Yes, I know I need to familiarize myself with this, this is
precisely why I posted this note: in order to get some
information or pointers to this information.

Does anyone have some pointers to these standards and to some
simple free text utilities which can create a few sample
text files with a US or European keyboard on UNIX (for
test purpose, not production of course).

> In the case of GNAT, especially for a Japanese environment we
> can give a lot of help to GNAT Professional users, since our

Concerning the GNAT library, I will continue this topic
through the commercial channel with ACT.

-Thierry
________________________________________________________
Thierry Lelegard, Paris, France
E-mail: lelegard@club-internet.fr






^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Wide_String, Chinese & Japanese text files
  1999-08-21  0:00   ` Thierry Lelegard
@ 1999-08-22  0:00     ` Florian Weimer
  1999-08-25  0:00       ` Georg Bauhaus
  1999-08-22  0:00     ` Robert Dewar
  1 sibling, 1 reply; 6+ messages in thread
From: Florian Weimer @ 1999-08-22  0:00 UTC (permalink / raw)

"Thierry Lelegard" <lelegard@club-internet.fr> writes:

> Does anyone have some pointers to these standards and to some
> simple free text utilities which can create a few sample
> text files with a US or European keyboard on UNIX (for
> test purpose, not production of course).

Emacs 20.4 plus the intlfonts-1.1 package.  It is completely free
(well, GPL), and you can (at least in theory) edit quite a few of
those strange languages with it.  Japanese, Chinese (both simplified
and traditional), Hindi, and even French or German, for example.
Major drawbacks: no Unicode, no right-to-left writing.  (If you intend
to use XEmacs instead: I'd recommend against it. MULE support in recent
versions seems to be a bit limited due to obvious lack of testing.)

Another possibility is yudit (GPL, too -- sorry, don't know where I got
it from, but I can look it up if you are interested).  It does support
Unicode (and several encodings of it) and quite a few languages as well.
Major drawbacks: it works best with Bitsream's CyberBit TrueType Unicode
font, which once was freely available from Bitstream, but this offer
doesn't seem to exist anymore, and there's no visual feedback during the
composition of characters (which is especially helpful to beginners).
In addition, the choice of input methods seems to be rather limited
in comparision to Emacs (the X input method extension might cure that,
but I didn't test it at all).

Of course, I can't confirm that one of these tools is suitable for
production use.  (I'm already glad if someone understands my clumsy
English. ;)  In fact, I doubt it.  Nevertheless, you should be able to
create suitable sample text files using both programs togher. (And you
can always hope for spam from Asia -- I'm getting a lot of it these
days. :-/)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Wide_String, Chinese & Japanese text files
  1999-08-21  0:00   ` Thierry Lelegard
  1999-08-22  0:00     ` Florian Weimer
@ 1999-08-22  0:00     ` Robert Dewar
  1 sibling, 0 replies; 6+ messages in thread
From: Robert Dewar @ 1999-08-22  0:00 UTC (permalink / raw)

In article <7pmitf$r71$1@front3.grolier.fr>,
  "Thierry Lelegard" <lelegard@club-internet.fr> wrote:

> Does anyone have some pointers to these standards and to some
> simple free text utilities which can create a few sample
> text files with a US or European keyboard on UNIX (for
> test purpose, not production of course).

One thing to realize here is that, unlike the typical situation
with 8-bit codes, there are two separate things to worry about:

1. The encoding of each character into its 16-bit value

2. The manner in which 16-bit values are encoded, typically
into a stream of 8-bit bytes.

Ada has a lot to say about 1, but nothing at all to say about
2. In particular you cannot look at an encoding standard that
gives the 16-bit codes and then ask for a sample text file,
because a sample text file is about 2. rather than 1.

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Wide_String, Chinese & Japanese text files
  1999-08-22  0:00     ` Florian Weimer
@ 1999-08-25  0:00       ` Georg Bauhaus
  0 siblings, 0 replies; 6+ messages in thread
From: Georg Bauhaus @ 1999-08-25  0:00 UTC (permalink / raw)


Florian Weimer (fw@s.netic.de) wrote:
: "Thierry Lelegard" <lelegard@club-internet.fr> writes:

: > Does anyone have some pointers to these standards and to some
: > simple free text utilities which can create a few sample
: > text files with a US or European keyboard on UNIX (for
: > test purpose, not production of course).

Rob Pike's Editor sam writes UTF8-Textfiles; there is also a Windows
version that comes with a sample document showing three or four languages
from around the world. (look for sam.exe)

the utility tcs transforms from one encoding to another.
ftp://plan9.att.com/plan9/unixsrc/tcs.shar.Z

At least on the debian GNU sites, you can find the UNIX versions.

-# Georg




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~1999-08-25  0:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-08-20  0:00 Wide_String, Chinese & Japanese text files Thierry Lelegard
1999-08-21  0:00 ` Robert Dewar
1999-08-21  0:00   ` Thierry Lelegard
1999-08-22  0:00     ` Florian Weimer
1999-08-25  0:00       ` Georg Bauhaus
1999-08-22  0:00     ` Robert Dewar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox