From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.36.65.132 with SMTP id b4mr22236631itd.55.1514438451648;
        Wed, 27 Dec 2017 21:20:51 -0800 (PST)
X-Received: by 10.157.64.68 with SMTP id o4mr1006077oti.9.1514438451536; Wed,
 27 Dec 2017 21:20:51 -0800 (PST)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!paganini.bofh.team!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!i6no3226874itb.0!news-out.google.com!b73ni12212ita.0!nntp.google.com!g80no3220035itg.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Wed, 27 Dec 2017 21:20:51 -0800 (PST)
In-Reply-To: <p21c28$880$1@franka.jacob-sparre.dk>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com;
 posting-host=2601:191:8303:2100:5985:2c17:9409:aa9c;
 posting-account=fdRd8woAAADTIlxCu9FgvDrUK4wPzvy3
NNTP-Posting-Host: 2601:191:8303:2100:5985:2c17:9409:aa9c
References: <ccd8e071-c228-4518-967e-09011cd5e291@googlegroups.com>
 <a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com>
 <p21c28$880$1@franka.jacob-sparre.dk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9e0a433c-2c52-4118-8624-dd7c23496074@googlegroups.com>
Subject: Re: unicode and wide_text_io
From: Robert Eachus <rieachus@comcast.net>
Injection-Date: Thu, 28 Dec 2017 05:20:51 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Xref: reader02.eternal-september.org comp.lang.ada:49668
Date: 2017-12-27T21:20:51-08:00
List-Id: <comp.lang.ada>

On Wednesday, December 27, 2017 at 6:58:01 PM UTC-5, Randy Brukardt wrote:
> "Mehdi Saada" <00120260a@gmail.com> wrote in message=20
> news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com...
> >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably
> >> meant output of code points. That is a different beast. Convert a code
> >> point to UTF-8 string and output that. E.g.
> > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 strin=
g=20
> > even represent
> > codepoints next to the 255th ??
>=20
> Easy: it uses a variable-width representation.
>=20
> > I may have a rather very shallow understanding of characters encoding a=
nd=20
> > representation,
>=20
> That's the problem. Unless you can stick to Latin-1, you'll need to fix t=
hat=20
> understanding before contining.
>=20
> In Ada,  type Character =3D Latin-1 =3D first 255 code positions, 8-bit=
=20
> representation. Text_IO and type String are for Latin-1 strings.
>=20
> type Wide_Charater =3D BMP (Basic Multilingual Plane) =3D first 65535 cod=
e=20
> positions =3D UCS-2 =3D 16-bit representation.

There is also UTF16 which is identical to Unicode, characters in the range =
0D800 to 0DFFF are used as escapes to allow more than 65536 code-points.=20
>=20
> type Wide_Wide_Character =3D all of Unicode =3D UCS-4 =3D 32-bit represen=
tation.

No, all of UCS-4, everything defined in ISO-10646.
>=20
> There is no native support in Ada for UTF-8 or UTF-16 strings. There is a=
=20
> conversion package (Ada.Strings.Encoding) [which is nasty because it brea=
ks=20
> strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO a=
nd=20
> Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1=
=20
> (there is no good way to tell between them in the general case).
>=20
> Windows uses a BOM character at the start of UTF-8 files to differentiate=
=20
> (at least in programs like Notepad and the built-in edit control), but th=
at=20
> is not recommended by Unicode. I think they would prefer a world where=20
> Latin-1 had disappeared completely, but that of course is not the real=20
> world.
>=20
> That's probably enough character set info to get you into trouble. ;-)

Mild trouble anyway, no burnings, no heresy trials. The ISO-10646 standard =
does favor using the correct BOM at the start of UTF-8, UCS-2 and UCS-4.  U=
nicode is an extended version of UCS-2 to include pages other than the 1064=
6 BMP (Basic multilingual plane).  Using a BOM with Unicode may mislead a p=
rogram reading the file.  The problem is not telling Unicode from UCS-2 whe=
n they are different. There no differences between Unicode and UCS-2 and un=
less those extra pages are used.  Files in most languages will be identical=
.  Even Japanese and Chinese may not be detectable--unless you omit the BOM=
 for Unicode files. ;-)

> > Really ?? You're sayin' there position such as Wide_Character'Val(X)=20
> > doesn't correspond to the Xth character in the UNICODE standard ??

Whoo boy, digging a deep hole here. You have to keep in mind that there are=
 at least three character sets that matter when you are programming in Ada =
(or any other language.)

First, there is the character set that you use to create the program.  The =
Ada standard provides a default, and it is the one that the compiler tests =
use. But it is only a default, and GNAT accepts source in different formats=
. Back when Ada was new, there were compilers for programs written in IBM's=
 EBCDIC.

The second character set you care about (or set of them) are the Ada Charac=
ter type, and other character types.  In the IBM compiler above Character c=
orresponded to ASCII as expected.  The ordering of character literals was A=
SCII not EBCDIC, etc.

The third group of character sets are those that correspond to printers, di=
splays and keyboards.  If you need to write code that supports, say Cyrilli=
c terminals, you may end up with strings that are really in say Russian.  B=
est to gather them all in one "Language" package, to make it easier when yo=
u have to do Ukrainian. :-(

If all three character sets are the same, that's nice.  But it can lead to =
sloppy thinking.   Way back when the ARG was wrestling with this, getting e=
veryone on the same page about which set of character sets we were discussi=
ng now, allowed us to get things into reasonable shape going into the Ada 9=
X development.  You want your compiler to allow Shift-JIS in comments?  Sur=
e.  Just remember that an end of line, and only an end of line terminates a=
 comment.