From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII
X-Google-Thread: 103376,bcb6f63419c2a56b
X-Google-Attributes: gid103376,public
Path: 
 controlnews3.google.com!news2.google.com!news.maxwell.syr.edu!wn13feed!worldnet.att.net!bgtnsc04-news.ops.worldnet.att.net.POSTED!not-for-mail
From: David Starner <dvdeug@email.ro>
Subject: Re: Supporting full Unicode
User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table (Debian GNU/Linux))
Message-Id: <pan.2004.05.12.18.40.52.452322@email.ro>
Newsgroups: comp.lang.ada
References: <9j8oc.16324$V97.13312@newsread1.news.pas.earthlink.net>
 <2004512-94456-948110@foorum.com> <pan.2004.05.12.09.26.57.126499@email.ro>
 <dQmoc.58891$mU6.238072@newsb.telia.net> <2004512-125725-433248@foorum.com>
 <DTqoc.92307$dP1.289702@newsc.telia.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Date: Wed, 12 May 2004 18:55:54 GMT
NNTP-Posting-Host: 12.72.70.249
X-Complaints-To: abuse@worldnet.att.net
X-Trace: bgtnsc04-news.ops.worldnet.att.net 1084388154 12.72.70.249 (Wed,
 12 May 2004 18:55:54 GMT)
NNTP-Posting-Date: Wed, 12 May 2004 18:55:54 GMT
Organization: AT&T Worldnet
Xref: controlnews3.google.com comp.lang.ada:505
Date: 2004-05-12T18:55:54+00:00
List-Id: <comp.lang.ada>

On Wed, 12 May 2004 14:53:23 +0000, Bj�rn Persson wrote:

> Looks troublesome, eh? For UTF-8 I don't think it's even possible to 
> define such a type. I'd rather just define UTF-16 and UTF-8 strings as 
> byte sequences and represent even single characters as strings.

Why would encode UTF-16 as a byte sequences when they could encode it as
a series of words? You can't use UTF-16 internally as byte sequences
without worry about byte-order marks, because UTF-16 is constructively
ambigious as to whether it's big-endian or little-endian. Anything you
defined would either be UTF-16BE or UTF-16LE, and spend a lot of time
reassembling character pieces on the wrong endian architecture. UTF-16
should usually be encoded as words.

As for characters, they're not much use with Unicode. Even with Latin-1,
you can't uppercase a character to character, and any system that does
it is wrong, include Ada. The German esszett (�) uppercases to SS. You
can't even hold a whole "character" in a character in Unicode, because
you can't fit any attached combining characters in.