From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,43ab55a75a8b5d1
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII
Path: 
 g2news2.google.com!news2.google.com!news.germany.com!news.belwue.de!kanaga.switch.ch!switch.ch!news.tele.dk!news.tele.dk!small.news.tele.dk!newspeer1.se.telia.net!se.telia.net!masternews.telia.net.!newsb.telia.net.POSTED!not-for-mail
From: =?ISO-8859-15?Q?Bj=F6rn_Persson?= <spam-away@nowhere.nil>
User-Agent: Thunderbird 1.5.0.4 (X11/20060614)
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: System.WCh_Cnv
References: <e9302j$f70$1@news521.nifty.com>
 <3082414.k9Jeq3hKxq@linux1.krischik.com>
 <1152811469.003475.301520@s13g2000cwa.googlegroups.com>
 <HXytg.8972$E02.2845@newsb.telia.net>
 <1152862832.649761.205770@75g2000cwc.googlegroups.com>
 <HYLtg.9040$E02.2744@newsb.telia.net>
 <57677082.XS7luc1THj@linux1.krischik.com>
In-Reply-To: <57677082.XS7luc1THj@linux1.krischik.com>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 8bit
Message-ID: <Sxaxg.10370$E02.3445@newsb.telia.net>
Date: Mon, 24 Jul 2006 21:00:34 GMT
NNTP-Posting-Host: 83.250.106.238
X-Complaints-To: abuse@telia.com
X-Trace: newsb.telia.net 1153774834 83.250.106.238 (Mon,
 24 Jul 2006 23:00:34 CEST)
NNTP-Posting-Date: Mon, 24 Jul 2006 23:00:34 CEST
Organization: Telia Internet
Xref: g2news2.google.com comp.lang.ada:5905
Date: 2006-07-24T21:00:34+00:00
List-Id: <comp.lang.ada>

Martin Krischik wrote:
> Bj�rn Persson wrote:
> 
>> Martin Krischik wrote:
>>> I wonder about that. UCS character set are fixed length and UTF
>>> character sets are variable lengt. So is it rigth  to say that  UCS-4
>>> is UTF-32?
>> I believe every possible text will be encoded identically in UCS-4BE and
>> UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a
>> counter-example then I would like to see it. What character could take
>> up more than one code unit in UTF-32?
> 
> A few years ago you could have said the same replacing all '32' with '16'.
> Many programmers relied on UTF-16 and UCS-2 being the the same. There where
> no counter-examples at the time either. But one fine day in 2001 the
> unicode authority(s) defined the 65537'th character...
> 
> I know that currently only 21 bits are actually used and the unicode
> authority(s) have given up on using more codepoints. Still I am unsure of
> just declaring them both the same.

How about a *hypothetical* counter-example? If you had a character with
the code point 100000000 hexadecimal, how would you encode it in UTF-32?
I believe it's impossible; I believe UTF-32 is a fixed-width encoding.

I found what looks like the definition of UTF-32 in
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, page 76 (actually
page 23 in the file). It says:

"UTF-32 encoding form: The Unicode encoding form which assigns each
Unicode scalar value to a single unsigned 32-bit code unit with the same
numeric value as the Unicode scalar value."

Note "single".

Also, in Unicode Technical Report #17, at
http://www.unicode.org/reports/tr17/, UTF-32 is listed under "Examples
of fixed-width encoding forms", while UTF-16 is listed under "Examples
of variable-width encoding forms".

Of course, should the Unicode consortium make the unwise decision to
change the definition of UTF-32, then it might no longer be equivalent
to UCS-4, but then it would no longer be UTF-32. I would be a different
encoding, and would deserve a different name.

-- 
Bj�rn Persson                              PGP key A88682FD
                    omb jor ers @sv ge.
                    r o.b n.p son eri nu