From: "Björn Persson" <spam-away@nowhere.nil>
Subject: Re: Supporting full Unicode
Date: Wed, 12 May 2004 14:53:23 GMT
Date: 2004-05-12T14:53:23+00:00 [thread overview]
Message-ID: <DTqoc.92307$dP1.289702@newsc.telia.net> (raw)
In-Reply-To: <2004512-125725-433248@foorum.com>
Ludovic Brenta wrote:
> Bjorn Persson wrote:
>
>>David Starner wrote:
>>
>>>they should have defined Wide_Character to be UTF-16 like Java did.
>>
>>Keeping in mind that in UTF-16 some characters take two bytes and
>>others take four, how do you propose to define that type?
[...]
> But UTF-8 is gaining momemtum. Originally intended as an external
> encoding only, it is now in use as an internal encoding, too.
[...]
I'm not trying to stop anyone from using whatever encoding they like
internally. Just make sure you always know which encoding you have.
I just asked how David wanted to define a UTF-16 *character* type. A
UTF-16 *string* can be represented as an array of 16-bit elements, but
those elements aren't characters. In some cases an element is just half
a character. Only the fixed-width encodings can be easily represented as
arrays of characters.
Here's a datatype that can represent all UTF-16 characters and doesn't
accept illegal characters:
type Subrange is (One_Byte_Below_Hole,
One_Byte_Above_Hole,
Two_Byte);
type Code_Point_Below_Hole is range 0 .. 16#D7FF#;
type High_Surrogate_Code_Point is range 16#D800# .. 16#DBFF#;
type Low_Surrogate_Code_Point is range 16#DC00# .. 16#DFFF#;
type Code_Point_Above_Hole is range 16#E000# .. 16#FFFD#;
type UTF_16_Character(Block : Subrange) is record
case Block is
when One_Byte_Below_Hole =>
Value_Below_Hole : Code_Point_Below_Hole;
when One_Byte_Above_Hole =>
Value_Above_Hole : Code_Point_Above_Hole;
when Two_Byte =>
High_Surrogate_Value : High_Surrogate_Code_Point;
Low_Surrogate_Value : Low_Surrogate_Code_Point;
end case;
end record;
Looks troublesome, eh? For UTF-8 I don't think it's even possible to
define such a type. I'd rather just define UTF-16 and UTF-8 strings as
byte sequences and represent even single characters as strings.
So go ahead and use UTF-8 in your programs, or Shift-JIS or EBCDIC for
all I care, but think twice before you define datatypes for
variable-width *characters*.
--
Björn Persson
jor ers @sv ge.
b n_p son eri nu
next prev parent reply other threads:[~2004-05-12 14:53 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-05-11 17:45 Supporting full Unicode Brian Catlin
2004-05-12 7:44 ` Ludovic Brenta
2004-05-12 8:23 ` Marius Amado Alves
2004-05-12 10:43 ` Martin Krischik
2004-05-12 14:56 ` Björn Persson
2004-05-12 19:09 ` David Starner
2004-05-12 19:25 ` David Starner
2004-05-12 9:41 ` David Starner
2004-05-12 10:16 ` Björn Persson
2004-05-12 10:57 ` Ludovic Brenta
2004-05-12 14:53 ` Björn Persson [this message]
2004-05-12 18:55 ` David Starner
2004-05-12 9:30 ` Martin Krischik
2004-05-13 1:15 ` Randy Brukardt
2004-05-13 17:58 ` Brian Catlin
2004-05-13 19:42 ` Randy Brukardt
2004-05-14 8:40 ` Andersen Jacob Sparre
2004-05-14 20:20 ` Randy Brukardt
2004-05-14 4:00 ` Vadim Godunko
2004-05-14 17:51 ` Brian Catlin
-- strict thread matches above, loose matches on Subject: below --
2004-05-12 12:40 amado.alves
2004-05-12 14:34 ` Martin Krischik
2004-05-12 18:24 ` David Starner
2004-05-12 20:04 ` Florian Weimer
2004-05-12 14:12 amado.alves
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox