From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,bcb6f63419c2a56b X-Google-Attributes: gid103376,public Path: controlnews3.google.com!news1.google.com!newsfeed.stanford.edu!canoe.uoregon.edu!arclight.uoregon.edu!wn13feed!worldnet.att.net!bgtnsc04-news.ops.worldnet.att.net.POSTED!not-for-mail From: David Starner Subject: Re: Supporting full Unicode User-Agent: Pan/0.14.2.91 (As She Crawled Across the Table (Debian GNU/Linux)) Message-Id: Newsgroups: comp.lang.ada References: <9j8oc.16324$V97.13312@newsread1.news.pas.earthlink.net> <2004512-94456-948110@foorum.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Date: Wed, 12 May 2004 19:25:20 GMT NNTP-Posting-Host: 12.72.70.249 X-Complaints-To: abuse@worldnet.att.net X-Trace: bgtnsc04-news.ops.worldnet.att.net 1084389920 12.72.70.249 (Wed, 12 May 2004 19:25:20 GMT) NNTP-Posting-Date: Wed, 12 May 2004 19:25:20 GMT Organization: AT&T Worldnet Xref: controlnews3.google.com comp.lang.ada:507 Date: 2004-05-12T19:25:20+00:00 List-Id: > Indeed UTF-8 seems to rule. Probably because there are more ready-to-use low > level tools for 8-bit characters. Actually the proper tools for Unicode > should be 24-bit based. An ugly fact about Unicode is that the code space is > 24-bit and the encodings are all but 24 (8, 16, 32). Why is that ugly? UTF-16 or UTF-8 is virtually always going to be smaller, unless most of your text is in an obscure dead tongue, which is unlikely to found in quantities that need compression. It's not going to be faster to process, unless you're running on some terribly obscure architecture that natively handles 24 bit words. As someone else pointed out, it's not 24, it's roughly 20.1. As for compression, a comparison of compression formats on various Unicode encodings was made[1], and it was found that most of the difference between encodings was wiped out by compression. [1] http://www.cs.fit.edu/~ryan/compress/