From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII X-Google-Thread: 103376,bcb6f63419c2a56b X-Google-Attributes: gid103376,public Path: controlnews3.google.com!news1.google.com!newshub.sdsu.edu!newshosting.com!nx01.iad01.newshosting.com!newsfeed.icl.net!newsfeed.fjserv.net!newsfeed.wirehub.nl!news.tele.dk!news.tele.dk!small.news.tele.dk!news-stoc.telia.net!news-stoa.telia.net!telia.net!masternews.telia.net.!newsc.telia.net.POSTED!not-for-mail From: =?ISO-8859-1?Q?Bj=F6rn_Persson?= User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1) Gecko/20031114 X-Accept-Language: sv, en-us MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: Supporting full Unicode References: <9j8oc.16324$V97.13312@newsread1.news.pas.earthlink.net> <2004512-94456-948110@foorum.com> <2004512-125725-433248@foorum.com> In-Reply-To: <2004512-125725-433248@foorum.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Message-ID: Date: Wed, 12 May 2004 14:53:23 GMT NNTP-Posting-Host: 217.209.116.179 X-Complaints-To: abuse@telia.com X-Trace: newsc.telia.net 1084373603 217.209.116.179 (Wed, 12 May 2004 16:53:23 CEST) NNTP-Posting-Date: Wed, 12 May 2004 16:53:23 CEST Organization: Telia Internet Xref: controlnews3.google.com comp.lang.ada:492 Date: 2004-05-12T14:53:23+00:00 List-Id: Ludovic Brenta wrote: > Bjorn Persson wrote: >=20 >>David Starner wrote: >> >>>they should have defined Wide_Character to be UTF-16 like Java did. >> >>Keeping in mind that in UTF-16 some characters take two bytes and >>others take four, how do you propose to define that type? [...] > But UTF-8 is gaining momemtum. Originally intended as an external > encoding only, it is now in use as an internal encoding, too. [...] I'm not trying to stop anyone from using whatever encoding they like=20 internally. Just make sure you always know which encoding you have. I just asked how David wanted to define a UTF-16 *character* type. A=20 UTF-16 *string* can be represented as an array of 16-bit elements, but=20 those elements aren't characters. In some cases an element is just half=20 a character. Only the fixed-width encodings can be easily represented as = arrays of characters. Here's a datatype that can represent all UTF-16 characters and doesn't=20 accept illegal characters: type Subrange is (One_Byte_Below_Hole, One_Byte_Above_Hole, Two_Byte); type Code_Point_Below_Hole is range 0 .. 16#D7FF#; type High_Surrogate_Code_Point is range 16#D800# .. 16#DBFF#; type Low_Surrogate_Code_Point is range 16#DC00# .. 16#DFFF#; type Code_Point_Above_Hole is range 16#E000# .. 16#FFFD#; type UTF_16_Character(Block : Subrange) is record case Block is when One_Byte_Below_Hole =3D> Value_Below_Hole : Code_Point_Below_Hole; when One_Byte_Above_Hole =3D> Value_Above_Hole : Code_Point_Above_Hole; when Two_Byte =3D> High_Surrogate_Value : High_Surrogate_Code_Point; Low_Surrogate_Value : Low_Surrogate_Code_Point; end case; end record; Looks troublesome, eh? For UTF-8 I don't think it's even possible to=20 define such a type. I'd rather just define UTF-16 and UTF-8 strings as=20 byte sequences and represent even single characters as strings. So go ahead and use UTF-8 in your programs, or Shift-JIS or EBCDIC for=20 all I care, but think twice before you define datatypes for=20 variable-width *characters*. --=20 Bj=F6rn Persson jor ers @sv ge. b n_p son eri nu