From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Subject: Re: unicode and wide_text_io
Date: Thu, 28 Dec 2017 10:04:41 +0100
Date: 2017-12-28T10:04:41+01:00 [thread overview]
Message-ID: <p22c38$1adn$1@gioia.aioe.org> (raw)
In-Reply-To: a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com
On 2017-12-27 23:32, Mehdi Saada wrote:
> Fundamentaly, how can a UTF8 string even represent codepoints next to the 255th ??
UTF-8 uses a chain code to represent large integers. ASCII 7-bit is
coded as-as. Other characters require more than one octet. It is a
technique widely used in communication for lossless compression. The
drawback is that you cannot directly index characters in an UTF-8
string. But virtually no text processing algorithm need that. So not a
loss, actual.
In short, representation unit (octet) /= represented thing (character).
> Superscripts and subscripts means more change in the IO package.
> Before I could simply use the generic Integer_IO, but I have no clue
> how to do to output a specific code point for each digit in a
> specific base... wouldn't that mean rewriting part of Integer_IO ?
You mean the standard library Integer_IO? Sure, you will have to replace it.
> I may have a rather very shallow understanding of characters
> encoding and representation, and that's quite an understatement, but
> you said: "Ada's Character has Latin-1 encoding which differs from
> UTF-8 in the code positions greater than 127"
> Really ??
Yep. Latin-1 and UTF-8 have different representation. Both have ASCII
7-bit as a subset.
> You're sayin' there position such as Wide_Character'Val(X)
> doesn't correspond to the Xth character in the UNICODE standard ??
Character = Latin-1
Wide_Character = UCS-2
Wide_Wide_Character = UCS-4
Linux uses UTF-8 (for a long time). Windows uses either ASCII (so-called
A-calls) or UTF-16 (so-called W-calls). There was a time, long ago, when
Windows used UCS-2, but then they ditched it for UTF-16.
Now, Ada programmers insolently ignore the standard and pragmatically use:
Character = representation unit of UTF-8 (octet)
Wide_Character = representation unit of UTF-16
Wide_Wide_Character = UNICODE code point
This works most of the time, but one should be careful.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
next prev parent reply other threads:[~2017-12-28 9:04 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
2017-12-27 20:04 ` Dmitry A. Kazakov
2017-12-27 21:47 ` Dennis Lee Bieber
2017-12-27 22:32 ` Mehdi Saada
2017-12-27 22:33 ` Mehdi Saada
2017-12-27 22:48 ` Mehdi Saada
2017-12-27 23:32 ` Mehdi Saada
2017-12-27 23:57 ` Randy Brukardt
2017-12-28 5:20 ` Robert Eachus
2017-12-31 21:41 ` Keith Thompson
2017-12-28 9:04 ` Dmitry A. Kazakov [this message]
2017-12-28 11:06 ` Niklas Holsti
2017-12-28 11:50 ` Dmitry A. Kazakov
2017-12-28 13:15 ` Mehdi Saada
2017-12-28 14:25 ` Dmitry A. Kazakov
2017-12-28 14:32 ` Simon Wright
2017-12-28 15:28 ` Niklas Holsti
2017-12-28 15:47 ` 00120260b
2017-12-28 22:35 ` G.B.
2017-12-28 18:15 ` Simon Wright
2017-12-28 22:36 ` Mehdi Saada
2017-12-29 0:51 ` Randy Brukardt
2017-12-30 12:50 ` Björn Lundin
2017-12-30 15:33 ` Dennis Lee Bieber
2017-12-30 15:56 ` Dmitry A. Kazakov
2017-12-30 23:20 ` Björn Lundin
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox