comp.lang.ada
 help / color / mirror / Atom feed
From: "Björn Persson" <spam-away@nowhere.nil>
Subject: Re: System.WCh_Cnv
Date: Mon, 24 Jul 2006 21:00:34 GMT
Date: 2006-07-24T21:00:34+00:00	[thread overview]
Message-ID: <Sxaxg.10370$E02.3445@newsb.telia.net> (raw)
In-Reply-To: <57677082.XS7luc1THj@linux1.krischik.com>

Martin Krischik wrote:
> Bjï¿œrn Persson wrote:
> 
>> Martin Krischik wrote:
>>> I wonder about that. UCS character set are fixed length and UTF
>>> character sets are variable lengt. So is it rigth  to say that  UCS-4
>>> is UTF-32?
>> I believe every possible text will be encoded identically in UCS-4BE and
>> UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a
>> counter-example then I would like to see it. What character could take
>> up more than one code unit in UTF-32?
> 
> A few years ago you could have said the same replacing all '32' with '16'.
> Many programmers relied on UTF-16 and UCS-2 being the the same. There where
> no counter-examples at the time either. But one fine day in 2001 the
> unicode authority(s) defined the 65537'th character...
> 
> I know that currently only 21 bits are actually used and the unicode
> authority(s) have given up on using more codepoints. Still I am unsure of
> just declaring them both the same.

How about a *hypothetical* counter-example? If you had a character with
the code point 100000000 hexadecimal, how would you encode it in UTF-32?
I believe it's impossible; I believe UTF-32 is a fixed-width encoding.

I found what looks like the definition of UTF-32 in
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, page 76 (actually
page 23 in the file). It says:

"UTF-32 encoding form: The Unicode encoding form which assigns each
Unicode scalar value to a single unsigned 32-bit code unit with the same
numeric value as the Unicode scalar value."

Note "single".

Also, in Unicode Technical Report #17, at
http://www.unicode.org/reports/tr17/, UTF-32 is listed under "Examples
of fixed-width encoding forms", while UTF-16 is listed under "Examples
of variable-width encoding forms".

Of course, should the Unicode consortium make the unwise decision to
change the definition of UTF-32, then it might no longer be equivalent
to UCS-4, but then it would no longer be UTF-32. I would be a different
encoding, and would deserve a different name.

-- 
Bjï¿œrn Persson                              PGP key A88682FD
                    omb jor ers @sv ge.
                    r o.b n.p son eri nu



  reply	other threads:[~2006-07-24 21:00 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-07-12 14:13 System.WCh_Cnv Y.Tomino
2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik
2006-07-12 18:57   ` System.WCh_Cnv Björn Persson
2006-07-13 17:24   ` System.WCh_Cnv demoonlit
2006-07-13 21:30     ` System.WCh_Cnv Björn Persson
2006-07-14  7:19       ` System.WCh_Cnv Dmitry A. Kazakov
2006-07-14  7:40       ` System.WCh_Cnv Martin Krischik
2006-07-14 12:18         ` System.WCh_Cnv Björn Persson
2006-07-16 11:41           ` System.WCh_Cnv Martin Krischik
2006-07-24 21:00             ` Björn Persson [this message]
2006-07-24 23:35               ` System.WCh_Cnv Randy Brukardt
2006-07-25  0:45                 ` System.WCh_Cnv Marius Amado-Alves
2006-07-14 16:13         ` System.WCh_Cnv Georg Bauhaus
2006-07-12 18:57 ` System.WCh_Cnv Björn Persson
2006-07-13 17:34   ` System.WCh_Cnv demoonlit
     [not found] <8BB3B99E-16DA-4EBF-A2FE-50B079349CA9@amado-alves.info>
2006-07-25  0:45 ` System.WCh_Cnv Marius Amado-Alves
     [not found] <EBEKJMEEPPFAACCBBGNHAELNDIAA.randy@rrsoftware.com>
2006-07-25 10:31 ` System.WCh_Cnv Marius Amado-Alves
2006-07-25 12:21   ` System.WCh_Cnv Dmitry A. Kazakov
2006-07-25 13:03     ` System.WCh_Cnv Marius Amado-Alves
2006-07-25 13:36       ` System.WCh_Cnv Dmitry A. Kazakov
2006-07-25 14:09       ` System.WCh_Cnv Georg Bauhaus
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox