From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: unicode and wide_text_io
Date: Thu, 28 Dec 2017 10:04:41 +0100
Organization: Aioe.org NNTP Server
Message-ID: <p22c38$1adn$1@gioia.aioe.org>
References: <ccd8e071-c228-4518-967e-09011cd5e291@googlegroups.com>
 <a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com>
NNTP-Posting-Host: TliDXSPe+gBSGCqP3SEJ2Q.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.2
X-Notice: Filtered by postfilter v. 0.8.2
Content-Language: en-US
Xref: reader02.eternal-september.org comp.lang.ada:49669
Date: 2017-12-28T10:04:41+01:00
List-Id: <comp.lang.ada>

On 2017-12-27 23:32, Mehdi Saada wrote:

> Fundamentaly, how can a UTF8 string even represent  codepoints next to the 255th ??

UTF-8 uses a chain code to represent large integers. ASCII 7-bit is 
coded as-as. Other characters require more than one octet. It is a 
technique widely used in communication for lossless compression. The 
drawback is that you cannot directly index characters in an UTF-8 
string. But virtually no text processing algorithm need that. So not a 
loss, actual.

In short, representation unit (octet) /= represented thing (character).

> Superscripts and subscripts means more change in the IO package.
> Before I could simply use the generic Integer_IO, but I have no clue 
> how to do to output a specific code point for each digit in a
> specific  base... wouldn't that mean rewriting part of Integer_IO ?

You mean the standard library Integer_IO? Sure, you will have to replace it.

> I may have a rather very shallow understanding of characters
> encoding and representation, and that's quite an understatement, but
> you said: "Ada's Character has Latin-1 encoding which differs from
> UTF-8 in the  code positions greater than 127"
> Really ??

Yep. Latin-1 and UTF-8 have different representation. Both have ASCII 
7-bit as a subset.

> You're sayin' there position such as Wide_Character'Val(X)
> doesn't correspond to the Xth character in the UNICODE standard ??

Character = Latin-1
Wide_Character = UCS-2
Wide_Wide_Character = UCS-4

Linux uses UTF-8 (for a long time). Windows uses either ASCII (so-called 
A-calls) or UTF-16 (so-called W-calls). There was a time, long ago, when 
Windows used UCS-2, but then they ditched it for UTF-16.

Now, Ada programmers insolently ignore the standard and pragmatically use:

Character = representation unit of UTF-8 (octet)
Wide_Character = representation unit of UTF-16
Wide_Wide_Character = UNICODE code point

This works most of the time, but one should be careful.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de