Re: UTF-8 in strings - a bug?

comp.lang.ada
 help / color / mirror / Atom feed

From: "Björn Persson" <spam-away@nowhere.nil>
Subject: Re: UTF-8 in strings - a bug?
Date: Sat, 08 May 2004 11:06:40 GMT
Date: 2004-05-08T11:06:40+00:00	[thread overview]
Message-ID: <4b3nc.58514$mU6.237399@newsb.telia.net> (raw)
In-Reply-To: <pld65f8isk.fsf@sparre.crs4.it>

Jacob Sparre Andersen wrote:

> Your quotes (which may be unfair :-)

Sorry, I should have provided more context. Here's the relevant part of 
unicode/unicode.ads in XML/Ada version 1.0 from ACT-Europe, so you don't 
have to download the library just to see what I'm talking about:


--  Coded character sets  (packages Unicode.CCS.*)
--  ====================
--  Mapping from a set of abstract characters to the set of non-negative
--  integers
--  The integer associated with a character is called "code point", and the
--  character is called "encoded character"
--  Examples of these are:  ISO/8859-1, JIS X 0208, ...
--
--  Character naming (packages Unicode.Names.*)
--  ================
--  A unique name is assigned to each abstract character, so that it is
--  possible to get the same character no matter what repertoire is used.
--
--  Character Encoding Forms
--  ========================
--  Mapping from the set of integers used in a Coded Character Set to 
the set
--  of sequences of code units.
--  A "code unit" is integer occupying a specified binary width in a 
computer
--  architecture
--  Examples of fixed-width encoding forms:  7-bit, 8-bit, EBCDIC
--  Examples of variable-width encoding forms:  Utf-8, Utf-16,...
--
--  Character Encoding Scheme (packages Unicode.CES.*)
--  =========================
--  Mapping of code units into serialized byte sequences. It also takes into
--  account the byte-order serialization.

--  As a summary, converting a file containing latin-1 characters coded on
--  8 bits to a Utf8 latin2 file, the following steps are involved:
--
--     Latin1 string  (contains bytes associated with code points in Latin1)
--       |    "use Unicode.CES.Basic_8bit.To_Utf32"
--       v
--     Utf32 latin1 string (contains code points in Latin1)
--       |    "Convert argument to To_Utf32 should be
--       v         Unicode.CCS.Iso_8859_1.Convert"
--     Utf32 Unicode string (contains code points in Unicode)
--       |    "use Unicode.CES.Utf8.From_Utf32"
--       v
--     Utf8 Unicode string (contains code points in Unicode)
--       |    "Convert argument to From_Utf32 should be
--       v         Unicode.CCS.Iso_8859_2.Convert"
--     Utf8 Latin2 string (contains code points in Latin2)


Investigating furter, I see that docs/xml_2.html shows the exact same 
example of converting Latin-1 to "Utf8 Latin2".

-- 
Björn Persson

jor ers @sv ge.
b n_p son eri nu

next prev parent reply	other threads:[~2004-05-08 11:06 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-05-05 22:12 UTF-8 in strings - a bug? Björn Persson
2004-05-05 23:31 ` Robert I. Eachus
2004-05-06  8:34   ` Björn Persson
2004-05-06  9:25     ` Ludovic Brenta
2004-05-06 17:13       ` Björn Persson
2004-05-06 18:24       ` Martin Krischik
2004-05-07 23:32         ` Björn Persson
2004-05-08  6:38           ` Martin Krischik
2004-05-08  7:44           ` Jacob Sparre Andersen
2004-05-08 11:06             ` Björn Persson [this message]
2004-05-08 16:25               ` Martin Krischik
2004-05-09 12:16                 ` Georg Bauhaus
2004-05-10  6:29                   ` Martin Krischik
2004-05-08 12:10           ` Georg Bauhaus
2004-05-06  9:06 ` David Starner
2004-05-06 17:36   ` Björn Persson

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox