From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII
X-Google-Thread: 103376,bcb6f63419c2a56b
X-Google-Attributes: gid103376,public
Path: 
 controlnews3.google.com!news1.google.com!newshub.sdsu.edu!newshosting.com!nx01.iad01.newshosting.com!newsfeed.icl.net!newsfeed.fjserv.net!newsfeed.wirehub.nl!news.tele.dk!news.tele.dk!small.news.tele.dk!news-stoc.telia.net!news-stoa.telia.net!telia.net!masternews.telia.net.!newsc.telia.net.POSTED!not-for-mail
From: =?ISO-8859-1?Q?Bj=F6rn_Persson?= <spam-away@nowhere.nil>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1) Gecko/20031114
X-Accept-Language: sv, en-us
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: Supporting full Unicode
References: <9j8oc.16324$V97.13312@newsread1.news.pas.earthlink.net>
 <2004512-94456-948110@foorum.com> <pan.2004.05.12.09.26.57.126499@email.ro>
 <dQmoc.58891$mU6.238072@newsb.telia.net> <2004512-125725-433248@foorum.com>
In-Reply-To: <2004512-125725-433248@foorum.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable
Message-ID: <DTqoc.92307$dP1.289702@newsc.telia.net>
Date: Wed, 12 May 2004 14:53:23 GMT
NNTP-Posting-Host: 217.209.116.179
X-Complaints-To: abuse@telia.com
X-Trace: newsc.telia.net 1084373603 217.209.116.179 (Wed,
 12 May 2004 16:53:23 CEST)
NNTP-Posting-Date: Wed, 12 May 2004 16:53:23 CEST
Organization: Telia Internet
Xref: controlnews3.google.com comp.lang.ada:492
Date: 2004-05-12T14:53:23+00:00
List-Id: <comp.lang.ada>

Ludovic Brenta wrote:

> Bjorn Persson wrote:
>=20
>>David Starner wrote:
>>
>>>they should have defined Wide_Character to be UTF-16 like Java did.
>>
>>Keeping in mind that in UTF-16 some characters take two bytes and
>>others take four, how do you propose to define that type?

[...]

> But UTF-8 is gaining momemtum.  Originally intended as an external
> encoding only, it is now in use as an internal encoding, too.

[...]

I'm not trying to stop anyone from using whatever encoding they like=20
internally. Just make sure you always know which encoding you have.

I just asked how David wanted to define a UTF-16 *character* type. A=20
UTF-16 *string* can be represented as an array of 16-bit elements, but=20
those elements aren't characters. In some cases an element is just half=20
a character. Only the fixed-width encodings can be easily represented as =

arrays of characters.

Here's a datatype that can represent all UTF-16 characters and doesn't=20
accept illegal characters:

type Subrange is (One_Byte_Below_Hole,
                   One_Byte_Above_Hole,
                   Two_Byte);
type Code_Point_Below_Hole is range 0 .. 16#D7FF#;
type High_Surrogate_Code_Point is range 16#D800# .. 16#DBFF#;
type Low_Surrogate_Code_Point is range 16#DC00# .. 16#DFFF#;
type Code_Point_Above_Hole is range 16#E000# .. 16#FFFD#;
type UTF_16_Character(Block : Subrange) is record
    case Block is
       when One_Byte_Below_Hole =3D>
          Value_Below_Hole : Code_Point_Below_Hole;
       when One_Byte_Above_Hole =3D>
          Value_Above_Hole : Code_Point_Above_Hole;
       when Two_Byte =3D>
          High_Surrogate_Value : High_Surrogate_Code_Point;
          Low_Surrogate_Value : Low_Surrogate_Code_Point;
    end case;
end record;

Looks troublesome, eh? For UTF-8 I don't think it's even possible to=20
define such a type. I'd rather just define UTF-16 and UTF-8 strings as=20
byte sequences and represent even single characters as strings.

So go ahead and use UTF-8 in your programs, or Shift-JIS or EBCDIC for=20
all I care, but think twice before you define datatypes for=20
variable-width *characters*.

--=20
Bj=F6rn Persson

jor ers @sv ge.
b n_p son eri nu