From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: 103376,43ab55a75a8b5d1 X-Google-Attributes: gid103376,public X-Google-Language: ENGLISH,ASCII Path: g2news2.google.com!news2.google.com!news.germany.com!news.belwue.de!kanaga.switch.ch!switch.ch!news.tele.dk!news.tele.dk!small.news.tele.dk!newspeer1.se.telia.net!se.telia.net!masternews.telia.net.!newsb.telia.net.POSTED!not-for-mail From: =?ISO-8859-15?Q?Bj=F6rn_Persson?= User-Agent: Thunderbird 1.5.0.4 (X11/20060614) MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: System.WCh_Cnv References: <3082414.k9Jeq3hKxq@linux1.krischik.com> <1152811469.003475.301520@s13g2000cwa.googlegroups.com> <1152862832.649761.205770@75g2000cwc.googlegroups.com> <57677082.XS7luc1THj@linux1.krischik.com> In-Reply-To: <57677082.XS7luc1THj@linux1.krischik.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 8bit Message-ID: Date: Mon, 24 Jul 2006 21:00:34 GMT NNTP-Posting-Host: 83.250.106.238 X-Complaints-To: abuse@telia.com X-Trace: newsb.telia.net 1153774834 83.250.106.238 (Mon, 24 Jul 2006 23:00:34 CEST) NNTP-Posting-Date: Mon, 24 Jul 2006 23:00:34 CEST Organization: Telia Internet Xref: g2news2.google.com comp.lang.ada:5905 Date: 2006-07-24T21:00:34+00:00 List-Id: Martin Krischik wrote: > Bj�rn Persson wrote: > >> Martin Krischik wrote: >>> I wonder about that. UCS character set are fixed length and UTF >>> character sets are variable lengt. So is it rigth to say that UCS-4 >>> is UTF-32? >> I believe every possible text will be encoded identically in UCS-4BE and >> UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a >> counter-example then I would like to see it. What character could take >> up more than one code unit in UTF-32? > > A few years ago you could have said the same replacing all '32' with '16'. > Many programmers relied on UTF-16 and UCS-2 being the the same. There where > no counter-examples at the time either. But one fine day in 2001 the > unicode authority(s) defined the 65537'th character... > > I know that currently only 21 bits are actually used and the unicode > authority(s) have given up on using more codepoints. Still I am unsure of > just declaring them both the same. How about a *hypothetical* counter-example? If you had a character with the code point 100000000 hexadecimal, how would you encode it in UTF-32? I believe it's impossible; I believe UTF-32 is a fixed-width encoding. I found what looks like the definition of UTF-32 in http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, page 76 (actually page 23 in the file). It says: "UTF-32 encoding form: The Unicode encoding form which assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value." Note "single". Also, in Unicode Technical Report #17, at http://www.unicode.org/reports/tr17/, UTF-32 is listed under "Examples of fixed-width encoding forms", while UTF-16 is listed under "Examples of variable-width encoding forms". Of course, should the Unicode consortium make the unwise decision to change the definition of UTF-32, then it might no longer be equivalent to UCS-4, but then it would no longer be UTF-32. I would be a different encoding, and would deserve a different name. -- Bj�rn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu