Re: Character Sets - David Starner

comp.lang.ada
 help / color / mirror / Atom feed

From: starner@okstate.edu (David Starner)
Subject: Re: Character Sets
Date: 13 Dec 2002 19:27:24 -0800
Date: 2002-12-14T03:27:25+00:00	[thread overview]
Message-ID: <81f70ac6.0212131927.4fa6b642@posting.google.com> (raw)
In-Reply-To: mailman.1038963002.11173.comp.lang.ada@ada.eu.org

> There is also an AI in the works, having something to do with 32-bit
> characters.  I don't remember the AI number.

In response to AI-00285:

Why is Latin-9's introduction such a big deal? Latin-1 is still the
"standard" 8-bit character set, and so immortalized in HTML and
other places. Latin-9 is just another character set, no more 
important then any other 8-bit set. Sure, people in Western
Europe are using it; but I bet more people still use Latin-1 
then Latin-9, and more people probably use KOI8-R than Latin-9.
There are many character sets out there; adding support for just
one more doesn't help things. Especially as anyone writing for
international systems needs at the very least to set the character
set on startup rather than compile.

From: Pascal Leroy
> I still think
> that we want to retain the capacity of using 16-bit blobs to represent
> characters in the BMP, as 99.5% of practical applications will only need the
> BMP.

I sort of feel like this is saying that 99.5% of practical
applications will never need a "q". For any program that handles text,
there shouldn't be arbitrary restrictions on what comes in and out; a
program that handles Unicode should handle Unicode, instead of the
subset the programmer thought people would use. That's half the use of
Unicode; being able to use Latin letter Kra, and knowing that you
aren't limited to the systems that handle ISO-6937, or Ogham and
NSAI-434.

> Anyway, I don't think it is reasonable to force applications to go to the
> full 32-bit overhead just because they use, say, the french OE ligature.

Applications don't use the French OE ligature; users do. And
arbitrarily limiting users does not make your system a pleasure to
use.

In any case, how much overhead are we talking? In worst case
scenarios, we're talking a doubling of the memory the program uses.
But embedded systems are rarely heavy text users, and can probably
stay with Latin-1. I don't work with text files much larger than a
megabyte, and don't know of anyone who does. And if you're working
with large amounts of data and need to reduce size, compression - both
standard (e.g. LZW) and Unicode-specific (e.g. SCSU or BOCU-1) work
better than just using 16 bits.

> We certainly don't want to get into that business.  The designers of Ada 95
> wisely decided to lump all of the characters in the range 16#0100# ..
> 16#FFFD# into the category special_character, so that they don't have to
> decide which is a letter, a number, etc.  Similarly they didn't provide
> classification functions or upper/lower conversions for wide characters.

So it's left for a dozen implementations to do.

> This seems reasonable if we don't want to have to amend Ada each time a
> bunch of characters are added to 10646.

Why would you have to amend Ada? Add a Unicode version constant, and
define the data in terms of its Unicode properties. Then the
recentness of the characters is just a quality of implementation
issue.

From: Robert Dewar
> We certainly
> put in a lot of work in GNAT in implementing wide character with many
> different representation schemes,

GNAT supports input files in a dozen mostly bizzare or archaic
formats. It doesn't strike me as very useful, especially considering
as it supports Latin-1, Latin-2 (both useful), but also Latin-4
(completely unused) and Latin-3 (good for Maltese and Esperanto, and
most Esperanto users don't use it). It doesn't support ISO-8859-5 or
KOI8-R (Russian), or ISO-8859-7 (Greek). It doesn't support changing
formats on the fly - many users have multiple encodings around,
besides the fact that having to compile a different binary for each
user is a pain. Oh, and last time I submitted a bug on it, it got
ignored, until I brought it up on the gcc list, when it was pointed
out that the feature I was using (style checking on source files)
wasn't supported with UTF-8.

From: Pascal Leroy
> Remember, we are talking Ada applications here.  There are probably many
> applications out there that deal with mathematical symbols or with Tengwar, 
> but I doubt that they are written in Ada.

Mathematical symbols and Tengwar are text. Any text handling system
that supports Unicode should handle them like any other text, because
sooner or later users will expect it to handle them. (If you're
unlucky, it will be the day that you're showing your system off in
Hong Kong, and the potential buyer decides to put in his name that
isn't in the BMP.) If people don't want Ada to be a general-purpose
programming language, then that's fine; but it's not acceptable for a
general-purpose programming language not to be able to handle text,
and for a modern language, that means Unicode.

next prev parent reply	other threads:[~2002-12-14  3:27 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-11-28 17:53 Character Sets Robert C. Leif
2002-11-28 18:08 ` Character Sets (plain text police report) Warren W. Gay VE3WWG
2002-11-28 18:11   ` Warren W. Gay VE3WWG
2002-11-29 11:12     ` Lutz Donnerhacke
2002-11-29 14:58       ` Frank J. Lhota
2002-11-29 20:37   ` Robert C. Leif
2002-11-30 14:49     ` Marin David Condic
2002-12-01 11:28       ` Jacob Sparre Andersen
2002-12-01 14:38         ` Marin David Condic
2002-12-01 20:25           ` Jacob Sparre Andersen
2002-12-02  9:43             ` Preben Randhol
2002-12-02 13:26               ` Marin David Condic
2002-12-02  6:44           ` Robert C. Leif
2002-12-02  9:41           ` Preben Randhol
2002-12-02 16:58           ` Charles Lindsey
2002-12-02 19:29     ` A suggestion, completely unrelated to the original topic Wes Groleau
2002-12-02 23:21       ` David C. Hoos, Sr.
2002-11-29 12:28 ` Character Sets Georg Bauhaus
2002-12-02 18:28 ` Stephen Leake
2002-12-03  2:45   ` Robert C. Leif
2002-12-03 13:33     ` Robert A Duff
2002-12-03 15:32       ` Juanma Barranquero
2002-12-04  0:49       ` Robert C. Leif
2002-12-14  3:27         ` David Starner [this message]
2002-12-14 22:53           ` Vadim Godunko
2002-12-15  3:46             ` David Starner
2002-12-15 23:26             ` Robert C. Leif
  -- strict thread matches above, loose matches on Subject: below --
2002-11-27  9:00 Grein, Christoph
2002-11-26 21:41 Robert C. Leif

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox