From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: unicode and wide_text_io
Date: Wed, 27 Dec 2017 17:57:59 -0600
Date: 2017-12-27T17:57:59-06:00 [thread overview]
Message-ID: <p21c28$880$1@franka.jacob-sparre.dk> (raw)
In-Reply-To: a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com
"Mehdi Saada" <00120260a@gmail.com> wrote in message
news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com...
>> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably
>> meant output of code points. That is a different beast. Convert a code
>> point to UTF-8 string and output that. E.g.
> Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string
> even represent
> codepoints next to the 255th ??
Easy: it uses a variable-width representation.
> I may have a rather very shallow understanding of characters encoding and
> representation,
That's the problem. Unless you can stick to Latin-1, you'll need to fix that
understanding before contining.
In Ada, type Character = Latin-1 = first 255 code positions, 8-bit
representation. Text_IO and type String are for Latin-1 strings.
type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code
positions = UCS-2 = 16-bit representation.
type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation.
There is no native support in Ada for UTF-8 or UTF-16 strings. There is a
conversion package (Ada.Strings.Encoding) [which is nasty because it breaks
strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and
Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1
(there is no good way to tell between them in the general case).
Windows uses a BOM character at the start of UTF-8 files to differentiate
(at least in programs like Notepad and the built-in edit control), but that
is not recommended by Unicode. I think they would prefer a world where
Latin-1 had disappeared completely, but that of course is not the real
world.
That's probably enough character set info to get you into trouble. ;-)
Randy.
and that's quite an understatement, but you said: "Ada's Character has
Latin-1 encoding which differs from UTF-8 in the code positions greater than
127"
> Really ?? You're sayin' there position such as Wide_Character'Val(X)
> doesn't correspond to the Xth character in the UNICODE standard ??
> And I know peanuts about the UCS-2 thing. I'm too ignorant for getting one
> bit of your saying, except it sounds like heresy in the ears of the Ada
> Church. Burn them all !!
> Ada.stream permits output of bits without any formatting, right ? If so,
> it might do.
next prev parent reply other threads:[~2017-12-27 23:57 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
2017-12-27 20:04 ` Dmitry A. Kazakov
2017-12-27 21:47 ` Dennis Lee Bieber
2017-12-27 22:32 ` Mehdi Saada
2017-12-27 22:33 ` Mehdi Saada
2017-12-27 22:48 ` Mehdi Saada
2017-12-27 23:32 ` Mehdi Saada
2017-12-27 23:57 ` Randy Brukardt [this message]
2017-12-28 5:20 ` Robert Eachus
2017-12-31 21:41 ` Keith Thompson
2017-12-28 9:04 ` Dmitry A. Kazakov
2017-12-28 11:06 ` Niklas Holsti
2017-12-28 11:50 ` Dmitry A. Kazakov
2017-12-28 13:15 ` Mehdi Saada
2017-12-28 14:25 ` Dmitry A. Kazakov
2017-12-28 14:32 ` Simon Wright
2017-12-28 15:28 ` Niklas Holsti
2017-12-28 15:47 ` 00120260b
2017-12-28 22:35 ` G.B.
2017-12-28 18:15 ` Simon Wright
2017-12-28 22:36 ` Mehdi Saada
2017-12-29 0:51 ` Randy Brukardt
2017-12-30 12:50 ` Björn Lundin
2017-12-30 15:33 ` Dennis Lee Bieber
2017-12-30 15:56 ` Dmitry A. Kazakov
2017-12-30 23:20 ` Björn Lundin
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox