Re: Supporting full Unicode

comp.lang.ada
 help / color / mirror / Atom feed

From: "Björn Persson" <spam-away@nowhere.nil>
Subject: Re: Supporting full Unicode
Date: Wed, 12 May 2004 14:53:23 GMT
Date: 2004-05-12T14:53:23+00:00	[thread overview]
Message-ID: <DTqoc.92307$dP1.289702@newsc.telia.net> (raw)
In-Reply-To: <2004512-125725-433248@foorum.com>

Ludovic Brenta wrote:

> Bjorn Persson wrote:
> 
>>David Starner wrote:
>>
>>>they should have defined Wide_Character to be UTF-16 like Java did.
>>
>>Keeping in mind that in UTF-16 some characters take two bytes and
>>others take four, how do you propose to define that type?

[...]

> But UTF-8 is gaining momemtum.  Originally intended as an external
> encoding only, it is now in use as an internal encoding, too.

[...]

I'm not trying to stop anyone from using whatever encoding they like 
internally. Just make sure you always know which encoding you have.

I just asked how David wanted to define a UTF-16 *character* type. A 
UTF-16 *string* can be represented as an array of 16-bit elements, but 
those elements aren't characters. In some cases an element is just half 
a character. Only the fixed-width encodings can be easily represented as 
arrays of characters.

Here's a datatype that can represent all UTF-16 characters and doesn't 
accept illegal characters:

type Subrange is (One_Byte_Below_Hole,
                   One_Byte_Above_Hole,
                   Two_Byte);
type Code_Point_Below_Hole is range 0 .. 16#D7FF#;
type High_Surrogate_Code_Point is range 16#D800# .. 16#DBFF#;
type Low_Surrogate_Code_Point is range 16#DC00# .. 16#DFFF#;
type Code_Point_Above_Hole is range 16#E000# .. 16#FFFD#;
type UTF_16_Character(Block : Subrange) is record
    case Block is
       when One_Byte_Below_Hole =>
          Value_Below_Hole : Code_Point_Below_Hole;
       when One_Byte_Above_Hole =>
          Value_Above_Hole : Code_Point_Above_Hole;
       when Two_Byte =>
          High_Surrogate_Value : High_Surrogate_Code_Point;
          Low_Surrogate_Value : Low_Surrogate_Code_Point;
    end case;
end record;

Looks troublesome, eh? For UTF-8 I don't think it's even possible to 
define such a type. I'd rather just define UTF-16 and UTF-8 strings as 
byte sequences and represent even single characters as strings.

So go ahead and use UTF-8 in your programs, or Shift-JIS or EBCDIC for 
all I care, but think twice before you define datatypes for 
variable-width *characters*.

-- 
Björn Persson

jor ers @sv ge.
b n_p son eri nu

next prev parent reply	other threads:[~2004-05-12 14:53 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-05-11 17:45 Supporting full Unicode Brian Catlin
2004-05-12  7:44 ` Ludovic Brenta
2004-05-12  8:23   ` Marius Amado Alves
2004-05-12 10:43     ` Martin Krischik
2004-05-12 14:56       ` Björn Persson
2004-05-12 19:09       ` David Starner
2004-05-12 19:25     ` David Starner
2004-05-12  9:41   ` David Starner
2004-05-12 10:16     ` Björn Persson
2004-05-12 10:57       ` Ludovic Brenta
2004-05-12 14:53         ` Björn Persson [this message]
2004-05-12 18:55           ` David Starner
2004-05-12  9:30 ` Martin Krischik
2004-05-13  1:15 ` Randy Brukardt
2004-05-13 17:58   ` Brian Catlin
2004-05-13 19:42     ` Randy Brukardt
2004-05-14  8:40       ` Andersen Jacob Sparre
2004-05-14 20:20         ` Randy Brukardt
2004-05-14  4:00 ` Vadim Godunko
2004-05-14 17:51   ` Brian Catlin
  -- strict thread matches above, loose matches on Subject: below --
2004-05-12 12:40 amado.alves
2004-05-12 14:34 ` Martin Krischik
2004-05-12 18:24   ` David Starner
2004-05-12 20:04   ` Florian Weimer
2004-05-12 14:12 amado.alves

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox