From: "Björn Persson" <spam-away@nowhere.nil>
Subject: Re: System.WCh_Cnv
Date: Mon, 24 Jul 2006 21:00:34 GMT
Date: 2006-07-24T21:00:34+00:00 [thread overview]
Message-ID: <Sxaxg.10370$E02.3445@newsb.telia.net> (raw)
In-Reply-To: <57677082.XS7luc1THj@linux1.krischik.com>
Martin Krischik wrote:
> Bjï¿œrn Persson wrote:
>
>> Martin Krischik wrote:
>>> I wonder about that. UCS character set are fixed length and UTF
>>> character sets are variable lengt. So is it rigth to say that UCS-4
>>> is UTF-32?
>> I believe every possible text will be encoded identically in UCS-4BE and
>> UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a
>> counter-example then I would like to see it. What character could take
>> up more than one code unit in UTF-32?
>
> A few years ago you could have said the same replacing all '32' with '16'.
> Many programmers relied on UTF-16 and UCS-2 being the the same. There where
> no counter-examples at the time either. But one fine day in 2001 the
> unicode authority(s) defined the 65537'th character...
>
> I know that currently only 21 bits are actually used and the unicode
> authority(s) have given up on using more codepoints. Still I am unsure of
> just declaring them both the same.
How about a *hypothetical* counter-example? If you had a character with
the code point 100000000 hexadecimal, how would you encode it in UTF-32?
I believe it's impossible; I believe UTF-32 is a fixed-width encoding.
I found what looks like the definition of UTF-32 in
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, page 76 (actually
page 23 in the file). It says:
"UTF-32 encoding form: The Unicode encoding form which assigns each
Unicode scalar value to a single unsigned 32-bit code unit with the same
numeric value as the Unicode scalar value."
Note "single".
Also, in Unicode Technical Report #17, at
http://www.unicode.org/reports/tr17/, UTF-32 is listed under "Examples
of fixed-width encoding forms", while UTF-16 is listed under "Examples
of variable-width encoding forms".
Of course, should the Unicode consortium make the unwise decision to
change the definition of UTF-32, then it might no longer be equivalent
to UCS-4, but then it would no longer be UTF-32. I would be a different
encoding, and would deserve a different name.
--
Bjï¿œrn Persson PGP key A88682FD
omb jor ers @sv ge.
r o.b n.p son eri nu
next prev parent reply other threads:[~2006-07-24 21:00 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-07-12 14:13 System.WCh_Cnv Y.Tomino
2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik
2006-07-12 18:57 ` System.WCh_Cnv Björn Persson
2006-07-13 17:24 ` System.WCh_Cnv demoonlit
2006-07-13 21:30 ` System.WCh_Cnv Björn Persson
2006-07-14 7:19 ` System.WCh_Cnv Dmitry A. Kazakov
2006-07-14 7:40 ` System.WCh_Cnv Martin Krischik
2006-07-14 12:18 ` System.WCh_Cnv Björn Persson
2006-07-16 11:41 ` System.WCh_Cnv Martin Krischik
2006-07-24 21:00 ` Björn Persson [this message]
2006-07-24 23:35 ` System.WCh_Cnv Randy Brukardt
2006-07-25 0:45 ` System.WCh_Cnv Marius Amado-Alves
2006-07-14 16:13 ` System.WCh_Cnv Georg Bauhaus
2006-07-12 18:57 ` System.WCh_Cnv Björn Persson
2006-07-13 17:34 ` System.WCh_Cnv demoonlit
[not found] <8BB3B99E-16DA-4EBF-A2FE-50B079349CA9@amado-alves.info>
2006-07-25 0:45 ` System.WCh_Cnv Marius Amado-Alves
[not found] <EBEKJMEEPPFAACCBBGNHAELNDIAA.randy@rrsoftware.com>
2006-07-25 10:31 ` System.WCh_Cnv Marius Amado-Alves
2006-07-25 12:21 ` System.WCh_Cnv Dmitry A. Kazakov
2006-07-25 13:03 ` System.WCh_Cnv Marius Amado-Alves
2006-07-25 13:36 ` System.WCh_Cnv Dmitry A. Kazakov
2006-07-25 14:09 ` System.WCh_Cnv Georg Bauhaus
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox