From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,bcb6f63419c2a56b
X-Google-Attributes: gid103376,public
Path: 
 controlnews3.google.com!news1.google.com!newsfeed.stanford.edu!canoe.uoregon.edu!arclight.uoregon.edu!wn13feed!worldnet.att.net!bgtnsc04-news.ops.worldnet.att.net.POSTED!not-for-mail
From: David Starner <dvdeug@email.ro>
Subject: Re: Supporting full Unicode
User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table (Debian GNU/Linux))
Message-Id: <pan.2004.05.12.19.10.16.123505@email.ro>
Newsgroups: comp.lang.ada
References: <9j8oc.16324$V97.13312@newsread1.news.pas.earthlink.net>
 <2004512-94456-948110@foorum.com>
 <mailman.115.1084354437.313.comp.lang.ada@ada-france.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Date: Wed, 12 May 2004 19:25:20 GMT
NNTP-Posting-Host: 12.72.70.249
X-Complaints-To: abuse@worldnet.att.net
X-Trace: bgtnsc04-news.ops.worldnet.att.net 1084389920 12.72.70.249 (Wed,
 12 May 2004 19:25:20 GMT)
NNTP-Posting-Date: Wed, 12 May 2004 19:25:20 GMT
Organization: AT&T Worldnet
Xref: controlnews3.google.com comp.lang.ada:507
Date: 2004-05-12T19:25:20+00:00
List-Id: <comp.lang.ada>

> Indeed UTF-8 seems to rule. Probably because there are more ready-to-use low
> level tools for 8-bit characters. Actually the proper tools for Unicode
> should be 24-bit based. An ugly fact about Unicode is that the code space is
> 24-bit and the encodings are all but 24 (8, 16, 32).

Why is that ugly? UTF-16 or UTF-8 is virtually always going to be smaller,
unless most of your text is in an obscure dead tongue, which is unlikely
to found in quantities that need compression. It's not going to be faster
to process, unless you're running on some terribly obscure architecture
that natively handles 24 bit words.

As someone else pointed out, it's not 24, it's roughly 20.1. 

As for compression, a comparison of compression formats on various Unicode
encodings was made[1], and it was found that most of the difference
between encodings was wiped out by compression.

[1] http://www.cs.fit.edu/~ryan/compress/