comp.lang.ada
 help / color / mirror / Atom feed
From: Martin Krischik <krischik@users.sourceforge.net>
Subject: Re: UTF-8 in strings - a bug?
Date: Sat, 08 May 2004 18:25:19 +0200
Date: 2004-05-08T18:25:19+02:00	[thread overview]
Message-ID: <3171026.RJblE7u9LK@linux1.krischik.com> (raw)
In-Reply-To: 4b3nc.58514$mU6.237399@newsb.telia.net

Bjï¿œrn Persson wrote:


> --     Latin1 string  (contains bytes associated with code points in
> Latin1)
> --       |    "use Unicode.CES.Basic_8bit.To_Utf32"
> --       v

Basic_8bit.To_Utf32 only make an 8bit -> 32bit expansion that is 16#xx#
becomes 16#000000xx#. The result is not really unicode but needed for
further convertions.

> --     Utf32 latin1 string (contains code points in Latin1)
> --       |    "Convert argument to To_Utf32 should be
> --       v         Unicode.CCS.Iso_8859_1.Convert"

This does the actual convertion. The result is now unicode.

> --     Utf32 Unicode string (contains code points in Unicode)
> --       |    "use Unicode.CES.Utf8.From_Utf32"
> --       v

Now we have standart UTF-8. 

> --     Utf8 Unicode string (contains code points in Unicode)
> --       |    "Convert argument to From_Utf32 should be
> --       v         Unicode.CCS.Iso_8859_2.Convert"

Now this is some Latin-2 optimized UTF-8. If this is truly usefull I don't
know.

> --     Utf8 Latin2 string (contains code points in Latin2)
> 
> 
> Investigating furter, I see that docs/xml_2.html shows the exact same
> example of converting Latin-1 to "Utf8 Latin2".

The UTF-X encodings can start with a BOM "Byte-order mark". This changes the
behaviour of the encoding:

   ------------------------------
   -- Byte-order mark handling --
   ------------------------------

   type Bom_Type is
     (Utf8_All,  --  Utf8-encoding
      Utf16_LE,  --  Utf16 little-endian encoding
      Utf16_BE,  --  Utf16 big-endian encoding
      Utf32_LE,  --  Utf32 little-endian encoding
      Utf32_BE,  --  Utf32 big-endian encoding
      Ucs4_BE,   --  UCS-4, big endian machine (1234 order)
      Ucs4_LE,   --  UCS-4, little endian machine (4321 order)
      Ucs4_2143, --  UCS-4, unusual byte order (2143 order)
      Ucs4_3412, --  UCS-4, unusual byte order (3412 order)
      Unknown);  --  Unknown, assumed to be ASCII compatible

BTW: I am currently adding Wide_Character support to the XMLAda/Unicode
package.

With Regards

Martin

-- 
mailto://krischik@users.sourceforge.net
http://www.ada.krischik.com




  reply	other threads:[~2004-05-08 16:25 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-05-05 22:12 UTF-8 in strings - a bug? Björn Persson
2004-05-05 23:31 ` Robert I. Eachus
2004-05-06  8:34   ` Björn Persson
2004-05-06  9:25     ` Ludovic Brenta
2004-05-06 17:13       ` Björn Persson
2004-05-06 18:24       ` Martin Krischik
2004-05-07 23:32         ` Björn Persson
2004-05-08  6:38           ` Martin Krischik
2004-05-08  7:44           ` Jacob Sparre Andersen
2004-05-08 11:06             ` Björn Persson
2004-05-08 16:25               ` Martin Krischik [this message]
2004-05-09 12:16                 ` Georg Bauhaus
2004-05-10  6:29                   ` Martin Krischik
2004-05-08 12:10           ` Georg Bauhaus
2004-05-06  9:06 ` David Starner
2004-05-06 17:36   ` Björn Persson
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox