From: "Björn Persson" <spam-away@nowhere.nil>
Subject: Re: UTF-8 in strings - a bug?
Date: Sat, 08 May 2004 11:06:40 GMT
Date: 2004-05-08T11:06:40+00:00 [thread overview]
Message-ID: <4b3nc.58514$mU6.237399@newsb.telia.net> (raw)
In-Reply-To: <pld65f8isk.fsf@sparre.crs4.it>
Jacob Sparre Andersen wrote:
> Your quotes (which may be unfair :-)
Sorry, I should have provided more context. Here's the relevant part of
unicode/unicode.ads in XML/Ada version 1.0 from ACT-Europe, so you don't
have to download the library just to see what I'm talking about:
-- Coded character sets (packages Unicode.CCS.*)
-- ====================
-- Mapping from a set of abstract characters to the set of non-negative
-- integers
-- The integer associated with a character is called "code point", and the
-- character is called "encoded character"
-- Examples of these are: ISO/8859-1, JIS X 0208, ...
--
-- Character naming (packages Unicode.Names.*)
-- ================
-- A unique name is assigned to each abstract character, so that it is
-- possible to get the same character no matter what repertoire is used.
--
-- Character Encoding Forms
-- ========================
-- Mapping from the set of integers used in a Coded Character Set to
the set
-- of sequences of code units.
-- A "code unit" is integer occupying a specified binary width in a
computer
-- architecture
-- Examples of fixed-width encoding forms: 7-bit, 8-bit, EBCDIC
-- Examples of variable-width encoding forms: Utf-8, Utf-16,...
--
-- Character Encoding Scheme (packages Unicode.CES.*)
-- =========================
-- Mapping of code units into serialized byte sequences. It also takes into
-- account the byte-order serialization.
-- As a summary, converting a file containing latin-1 characters coded on
-- 8 bits to a Utf8 latin2 file, the following steps are involved:
--
-- Latin1 string (contains bytes associated with code points in Latin1)
-- | "use Unicode.CES.Basic_8bit.To_Utf32"
-- v
-- Utf32 latin1 string (contains code points in Latin1)
-- | "Convert argument to To_Utf32 should be
-- v Unicode.CCS.Iso_8859_1.Convert"
-- Utf32 Unicode string (contains code points in Unicode)
-- | "use Unicode.CES.Utf8.From_Utf32"
-- v
-- Utf8 Unicode string (contains code points in Unicode)
-- | "Convert argument to From_Utf32 should be
-- v Unicode.CCS.Iso_8859_2.Convert"
-- Utf8 Latin2 string (contains code points in Latin2)
Investigating furter, I see that docs/xml_2.html shows the exact same
example of converting Latin-1 to "Utf8 Latin2".
--
Björn Persson
jor ers @sv ge.
b n_p son eri nu
next prev parent reply other threads:[~2004-05-08 11:06 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-05-05 22:12 UTF-8 in strings - a bug? Björn Persson
2004-05-05 23:31 ` Robert I. Eachus
2004-05-06 8:34 ` Björn Persson
2004-05-06 9:25 ` Ludovic Brenta
2004-05-06 17:13 ` Björn Persson
2004-05-06 18:24 ` Martin Krischik
2004-05-07 23:32 ` Björn Persson
2004-05-08 6:38 ` Martin Krischik
2004-05-08 7:44 ` Jacob Sparre Andersen
2004-05-08 11:06 ` Björn Persson [this message]
2004-05-08 16:25 ` Martin Krischik
2004-05-09 12:16 ` Georg Bauhaus
2004-05-10 6:29 ` Martin Krischik
2004-05-08 12:10 ` Georg Bauhaus
2004-05-06 9:06 ` David Starner
2004-05-06 17:36 ` Björn Persson
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox