From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII
X-Google-Thread: 103376,1086bab45b40d4b0
X-Google-Attributes: gid103376,public
Path: 
 controlnews3.google.com!news2.google.com!news.maxwell.syr.edu!newsfeed.icl.net!newsfeed.fjserv.net!newsfeed.freenet.de!newsfeed00.sul.t-online.de!newsmm00.sul.t-online.de!t-online.de!news.t-online.com!not-for-mail
From: Martin Krischik <krischik@users.sourceforge.net>
Newsgroups: comp.lang.ada
Subject: Re: UTF-8 in strings - a bug?
Date: Sat, 08 May 2004 18:25:19 +0200
Organization: AdaCL
Message-ID: <3171026.RJblE7u9LK@linux1.krischik.com>
References: <TEdmc.58085$mU6.237063@newsb.telia.net>
 <WJOdndbsxKPZ5ATdRVn-iQ@comcast.com> <lMmmc.58280$mU6.237078@newsb.telia.net>
 <200456-112553-85684@foorum.com> <2178612.8V5KANFFf5@linux1.krischik.com>
 <q0Vmc.58459$mU6.237464@newsb.telia.net> <pld65f8isk.fsf@sparre.crs4.it>
 <4b3nc.58514$mU6.237399@newsb.telia.net>
Reply-To: krischik@users.sourceforge.net
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-15
Content-Transfer-Encoding: 8Bit
X-Trace: news.t-online.com 1084033593 06 29191 Rzz0G1NJqsyeuay 040508 16:26:33
X-Complaints-To: usenet-abuse@t-online.de
X-ID: EkF4osZ1oeFHvbqZzjaIeOXTIaS28rDiKUWqH0sIAb-MNsa3xlS+4A
User-Agent: KNode/0.7.7
Xref: controlnews3.google.com comp.lang.ada:391
Date: 2004-05-08T18:25:19+02:00
List-Id: <comp.lang.ada>

Bj�rn Persson wrote:


> --     Latin1 string  (contains bytes associated with code points in
> Latin1)
> --       |    "use Unicode.CES.Basic_8bit.To_Utf32"
> --       v

Basic_8bit.To_Utf32 only make an 8bit -> 32bit expansion that is 16#xx#
becomes 16#000000xx#. The result is not really unicode but needed for
further convertions.

> --     Utf32 latin1 string (contains code points in Latin1)
> --       |    "Convert argument to To_Utf32 should be
> --       v         Unicode.CCS.Iso_8859_1.Convert"

This does the actual convertion. The result is now unicode.

> --     Utf32 Unicode string (contains code points in Unicode)
> --       |    "use Unicode.CES.Utf8.From_Utf32"
> --       v

Now we have standart UTF-8. 

> --     Utf8 Unicode string (contains code points in Unicode)
> --       |    "Convert argument to From_Utf32 should be
> --       v         Unicode.CCS.Iso_8859_2.Convert"

Now this is some Latin-2 optimized UTF-8. If this is truly usefull I don't
know.

> --     Utf8 Latin2 string (contains code points in Latin2)
> 
> 
> Investigating furter, I see that docs/xml_2.html shows the exact same
> example of converting Latin-1 to "Utf8 Latin2".

The UTF-X encodings can start with a BOM "Byte-order mark". This changes the
behaviour of the encoding:

   ------------------------------
   -- Byte-order mark handling --
   ------------------------------

   type Bom_Type is
     (Utf8_All,  --  Utf8-encoding
      Utf16_LE,  --  Utf16 little-endian encoding
      Utf16_BE,  --  Utf16 big-endian encoding
      Utf32_LE,  --  Utf32 little-endian encoding
      Utf32_BE,  --  Utf32 big-endian encoding
      Ucs4_BE,   --  UCS-4, big endian machine (1234 order)
      Ucs4_LE,   --  UCS-4, little endian machine (4321 order)
      Ucs4_2143, --  UCS-4, unusual byte order (2143 order)
      Ucs4_3412, --  UCS-4, unusual byte order (3412 order)
      Unknown);  --  Unknown, assumed to be ASCII compatible

BTW: I am currently adding Wide_Character support to the XMLAda/Unicode
package.

With Regards

Martin

-- 
mailto://krischik@users.sourceforge.net
http://www.ada.krischik.com