From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII X-Google-Thread: 103376,1086bab45b40d4b0 X-Google-Attributes: gid103376,public Path: controlnews3.google.com!news1.google.com!news.glorb.com!news-stoc.telia.net!news-stoa.telia.net!telia.net!masternews.telia.net.!newsb.telia.net.POSTED!not-for-mail From: =?ISO-8859-1?Q?Bj=F6rn_Persson?= User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.1) Gecko/20031114 X-Accept-Language: sv, en-us MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: UTF-8 in strings - a bug? References: <200456-112553-85684@foorum.com> <2178612.8V5KANFFf5@linux1.krischik.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Message-ID: <4b3nc.58514$mU6.237399@newsb.telia.net> Date: Sat, 08 May 2004 11:06:40 GMT NNTP-Posting-Host: 217.209.116.179 X-Complaints-To: abuse@telia.com X-Trace: newsb.telia.net 1084014400 217.209.116.179 (Sat, 08 May 2004 13:06:40 CEST) NNTP-Posting-Date: Sat, 08 May 2004 13:06:40 CEST Organization: Telia Internet Xref: controlnews3.google.com comp.lang.ada:384 Date: 2004-05-08T11:06:40+00:00 List-Id: Jacob Sparre Andersen wrote: > Your quotes (which may be unfair :-) Sorry, I should have provided more context. Here's the relevant part of=20 unicode/unicode.ads in XML/Ada version 1.0 from ACT-Europe, so you don't = have to download the library just to see what I'm talking about: -- Coded character sets (packages Unicode.CCS.*) -- =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D -- Mapping from a set of abstract characters to the set of non-negative -- integers -- The integer associated with a character is called "code point", and t= he -- character is called "encoded character" -- Examples of these are: ISO/8859-1, JIS X 0208, ... -- -- Character naming (packages Unicode.Names.*) -- =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D -- A unique name is assigned to each abstract character, so that it is -- possible to get the same character no matter what repertoire is used.= -- -- Character Encoding Forms -- =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D -- Mapping from the set of integers used in a Coded Character Set to=20 the set -- of sequences of code units. -- A "code unit" is integer occupying a specified binary width in a=20 computer -- architecture -- Examples of fixed-width encoding forms: 7-bit, 8-bit, EBCDIC -- Examples of variable-width encoding forms: Utf-8, Utf-16,... -- -- Character Encoding Scheme (packages Unicode.CES.*) -- =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D -- Mapping of code units into serialized byte sequences. It also takes i= nto -- account the byte-order serialization. -- As a summary, converting a file containing latin-1 characters coded o= n -- 8 bits to a Utf8 latin2 file, the following steps are involved: -- -- Latin1 string (contains bytes associated with code points in Lati= n1) -- | "use Unicode.CES.Basic_8bit.To_Utf32" -- v -- Utf32 latin1 string (contains code points in Latin1) -- | "Convert argument to To_Utf32 should be -- v Unicode.CCS.Iso_8859_1.Convert" -- Utf32 Unicode string (contains code points in Unicode) -- | "use Unicode.CES.Utf8.From_Utf32" -- v -- Utf8 Unicode string (contains code points in Unicode) -- | "Convert argument to From_Utf32 should be -- v Unicode.CCS.Iso_8859_2.Convert" -- Utf8 Latin2 string (contains code points in Latin2) Investigating furter, I see that docs/xml_2.html shows the exact same=20 example of converting Latin-1 to "Utf8 Latin2". --=20 Bj=F6rn Persson jor ers @sv ge. b n_p son eri nu