AI-285 - Comment from Unicode list

comp.lang.ada
 help / color / mirror / Atom feed

* AI-285 - Comment from Unicode list
@ 2004-02-15  0:25 David Starner
  2004-02-15 14:45 ` UTF-8 (was: AI-285 - Comment from Unicode list) Wes Groleau
  0 siblings, 1 reply; 8+ messages in thread
From: David Starner @ 2004-02-15  0:25 UTC (permalink / raw)

Markus Scherer writes (at
<http://www.unicode.org/mail-arch/unicode-ml/y2004-m01/0508.html>

***
D. Starner wrote:
>> #12 UTF-16 for Processing
>
> This is incorrect in saying that Ada uses UTF-16. It supports UCS-2
> only. The text of the standard says:
>
> The predefined type Wide_Character is a character type whose values
> correspond to the 65536 code positions of the ISO 10646 Basic
> Multilingual Plane (BMP). [...]
>
> which doesn't include surrogate code points. The next

True, but not much different/worse than for Java, for example. Once you have 16-bit types and string
literals, adding a few functions to deal with supplementary code points is not hard. We did this for
Java in ICU4J.

There is little difference for a language between supporting UCS-2 or UTF-16 because where functions
do not handle supplementary code points, they usually also don't handle Unicode versions above 3.0 -
so string case mappings etc. are the same.

A language like that can be relatively easily upgraded to full UTF-16 handling by updating the
character and string function implementations, and adding a few new APIs - that is what Java is
doing. The upgrade is done naturally when the standard functions are extended to Unicode 3.1 or later.

As such, whether the strings contain UCS-2 or UTF-16 depends less on the language definition and
more on the functions that are used, and the version of the standard libraries.

> version of Ada will have 32-bit characters to fully
> support Unicode - the text of the proposal is here:
>
> <http://www.ada-auth.org/cgi-bin/cvsweb.cgi/AIs/AI-00285.TXT?rev=1.14>
>
> plus lengthy discussion on the issues.

Thank you very much for the link.

The proposal seems to be to continue to treat Wide strings as UCS-2, and to treat Wide_Wide strings
(a new type) as UTF-32. This would give Ada a total of three different native string types on the
language level. It would also mean that existing code, using 16-bit strings, would not benefit from
an upgrade but would instead have to be rewritten for support of supplementary code points. This may
in fact slow down such support.

There will be a presentation of the choices for Java (including UTF-32) at IUC 25.

Best regards,
markus

***

^ permalink raw reply	[flat|nested] 8+ messages in thread

* UTF-8 (was: AI-285 - Comment from Unicode list)
  2004-02-15  0:25 AI-285 - Comment from Unicode list David Starner
@ 2004-02-15 14:45 ` Wes Groleau
  2004-02-15 22:31   ` David Starner
  2004-02-17 13:39   ` UTF-8 Georg Bauhaus
  0 siblings, 2 replies; 8+ messages in thread
From: Wes Groleau @ 2004-02-15 14:45 UTC (permalink / raw)



I'd like to see a package (or built-in) to support UTF-8.
But that's just me.  I do a little bit of Polish and Japanese
and might do a little Burmese, so I need Unicode.  But since
I'm mostly English and Spanish and French, if I used UTF-16
my files would be 49.x% zero bytes.

I have often been tempted to write such a package.
Has it already been done?

I admit it--I don't even know what UCS-2 is.  :-)



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: UTF-8 (was: AI-285 - Comment from Unicode list)
  2004-02-15 14:45 ` UTF-8 (was: AI-285 - Comment from Unicode list) Wes Groleau
@ 2004-02-15 22:31   ` David Starner
  2004-02-16 22:18     ` UTF-8 Wes Groleau
  2004-02-17 13:39   ` UTF-8 Georg Bauhaus
  1 sibling, 1 reply; 8+ messages in thread
From: David Starner @ 2004-02-15 22:31 UTC (permalink / raw)

On Sun, 15 Feb 2004 09:45:02 -0500, Wes Groleau wrote:
> I'd like to see a package (or built-in) to support UTF-8.
> But that's just me.  I do a little bit of Polish and Japanese
> and might do a little Burmese, so I need Unicode.  But since
> I'm mostly English and Spanish and French, if I used UTF-16
> my files would be 49.x% zero bytes.

But the internal character set has nothing to do with the external. We
could output UTF-8 and use UTF-16 or UTF-32 internally. In fact, if you
set the character set of the source code to UTF-8 with GNAT, it will input
and output UTF-8. (This is not a great design, IMO.)

> I have often been tempted to write such a package. Has it already been
> done?

http://sourceforge.net/projects/ngeadal/ will do it, among a few other
Unicode related things. I never really completed it, and it doesn't have
any sort of stream I/O (instead dumping files as a whole), but it should
work, and I'm willing to answer questions.

> I admit it--I don't even know what UCS-2 is.  :-)

Unicode is broken down into 17 planes, 4 of which are used in anyway. All
but one were empty until a couple years ago. UCS-2 is like UTF-16, but
doesn't support the surrogate code points needed to access planes besides
the first. That means that Gothic, Linear-A, Cuniform (in the future)
won't be supported; but it also means that the mathematical alphanumerics
and Cantonese won't be supported, as well as a lot of older literary
Chinese, Japanese, Korean and Vietnamese, and other minor Chinese
languages.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: UTF-8
  2004-02-15 22:31   ` David Starner
@ 2004-02-16 22:18     ` Wes Groleau
  2004-02-17  2:05       ` UTF-8 David Starner
  0 siblings, 1 reply; 8+ messages in thread
From: Wes Groleau @ 2004-02-16 22:18 UTC (permalink / raw)


David Starner wrote:
> But the internal character set has nothing to do with the external. We
> could output UTF-8 and use UTF-16 or UTF-32 internally. In fact, if you
> set the character set of the source code to UTF-8 with GNAT, it will input
> and output UTF-8. (This is not a great design, IMO.)

To tell the truth, I'm not completely sure what
it is exactly that I want.  :-)  To read in UTF-8
and turn them into Wide_String internally?
To read it into String as-is?  Or .....

But I do know that I use UTF-8 to store my text files.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: UTF-8
  2004-02-16 22:18     ` UTF-8 Wes Groleau
@ 2004-02-17  2:05       ` David Starner
  0 siblings, 0 replies; 8+ messages in thread
From: David Starner @ 2004-02-17  2:05 UTC (permalink / raw)

On Mon, 16 Feb 2004 17:18:09 -0500, Wes Groleau wrote:
> To tell the truth, I'm not completely sure what
> it is exactly that I want.  :-)  To read in UTF-8
> and turn them into Wide_String internally?
> To read it into String as-is?  Or .....

If you want to process them, loading them into String means you're going
to have to write all the code to step to the next character and what not
yourself. Of course, UTF-8 is designed to make a lot of processing easy.
I've never really written a text handling program in Ada. The programs
I've written treat messages as opaque blocks to be handed to gettext, in
which case String is the way to go.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: UTF-8
  2004-02-15 14:45 ` UTF-8 (was: AI-285 - Comment from Unicode list) Wes Groleau
  2004-02-15 22:31   ` David Starner
@ 2004-02-17 13:39   ` Georg Bauhaus
  2004-02-18  2:39     ` UTF-8 Wes Groleau
  2004-02-18  2:40     ` UTF-8 Wes Groleau
  1 sibling, 2 replies; 8+ messages in thread
From: Georg Bauhaus @ 2004-02-17 13:39 UTC (permalink / raw)


Wes Groleau <groleau+news@freeshell.org> wrote:
: 
: I have often been tempted to write such a package.
: Has it already been done?

You can find code in GNAT and in XML/Ada.
(I have a function to turn Wide_String values into
UTF-8 coded String values for passing to Exception_Information.
I can send it via eMail if it is of use.)


-- Georg



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: UTF-8
  2004-02-17 13:39   ` UTF-8 Georg Bauhaus
@ 2004-02-18  2:39     ` Wes Groleau
  2004-02-18  2:40     ` UTF-8 Wes Groleau
  1 sibling, 0 replies; 8+ messages in thread
From: Wes Groleau @ 2004-02-18  2:39 UTC (permalink / raw)


Georg Bauhaus wrote:
> (I have a function to turn Wide_String values into
> UTF-8 coded String values for passing to Exception_Information.
> I can send it via eMail if it is of use.)

Sure, I can stash it away somewhere.  At the moment,
I'm only doing writing and editing with an editor,
but I may have to do some searching or other stuff
some time.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: UTF-8
  2004-02-17 13:39   ` UTF-8 Georg Bauhaus
  2004-02-18  2:39     ` UTF-8 Wes Groleau
@ 2004-02-18  2:40     ` Wes Groleau
  1 sibling, 0 replies; 8+ messages in thread
From: Wes Groleau @ 2004-02-18  2:40 UTC (permalink / raw)


Georg Bauhaus wrote:
> I can send it via eMail if it is of use.)

oops, forgot: above address works, but

groleau+ada might be better (same host).



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-02-18  2:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-02-15  0:25 AI-285 - Comment from Unicode list David Starner
2004-02-15 14:45 ` UTF-8 (was: AI-285 - Comment from Unicode list) Wes Groleau
2004-02-15 22:31   ` David Starner
2004-02-16 22:18     ` UTF-8 Wes Groleau
2004-02-17  2:05       ` UTF-8 David Starner
2004-02-17 13:39   ` UTF-8 Georg Bauhaus
2004-02-18  2:39     ` UTF-8 Wes Groleau
2004-02-18  2:40     ` UTF-8 Wes Groleau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox