From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.36.65.132 with SMTP id b4mr22236631itd.55.1514438451648; Wed, 27 Dec 2017 21:20:51 -0800 (PST) X-Received: by 10.157.64.68 with SMTP id o4mr1006077oti.9.1514438451536; Wed, 27 Dec 2017 21:20:51 -0800 (PST) Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!paganini.bofh.team!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!i6no3226874itb.0!news-out.google.com!b73ni12212ita.0!nntp.google.com!g80no3220035itg.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Wed, 27 Dec 2017 21:20:51 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=2601:191:8303:2100:5985:2c17:9409:aa9c; posting-account=fdRd8woAAADTIlxCu9FgvDrUK4wPzvy3 NNTP-Posting-Host: 2601:191:8303:2100:5985:2c17:9409:aa9c References: User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <9e0a433c-2c52-4118-8624-dd7c23496074@googlegroups.com> Subject: Re: unicode and wide_text_io From: Robert Eachus Injection-Date: Thu, 28 Dec 2017 05:20:51 +0000 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Xref: reader02.eternal-september.org comp.lang.ada:49668 Date: 2017-12-27T21:20:51-08:00 List-Id: On Wednesday, December 27, 2017 at 6:58:01 PM UTC-5, Randy Brukardt wrote: > "Mehdi Saada" <00120260a@gmail.com> wrote in message=20 > news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com... > >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably > >> meant output of code points. That is a different beast. Convert a code > >> point to UTF-8 string and output that. E.g. > > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 strin= g=20 > > even represent > > codepoints next to the 255th ?? >=20 > Easy: it uses a variable-width representation. >=20 > > I may have a rather very shallow understanding of characters encoding a= nd=20 > > representation, >=20 > That's the problem. Unless you can stick to Latin-1, you'll need to fix t= hat=20 > understanding before contining. >=20 > In Ada, type Character =3D Latin-1 =3D first 255 code positions, 8-bit= =20 > representation. Text_IO and type String are for Latin-1 strings. >=20 > type Wide_Charater =3D BMP (Basic Multilingual Plane) =3D first 65535 cod= e=20 > positions =3D UCS-2 =3D 16-bit representation. There is also UTF16 which is identical to Unicode, characters in the range = 0D800 to 0DFFF are used as escapes to allow more than 65536 code-points.=20 >=20 > type Wide_Wide_Character =3D all of Unicode =3D UCS-4 =3D 32-bit represen= tation. No, all of UCS-4, everything defined in ISO-10646. >=20 > There is no native support in Ada for UTF-8 or UTF-16 strings. There is a= =20 > conversion package (Ada.Strings.Encoding) [which is nasty because it brea= ks=20 > strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO a= nd=20 > Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1= =20 > (there is no good way to tell between them in the general case). >=20 > Windows uses a BOM character at the start of UTF-8 files to differentiate= =20 > (at least in programs like Notepad and the built-in edit control), but th= at=20 > is not recommended by Unicode. I think they would prefer a world where=20 > Latin-1 had disappeared completely, but that of course is not the real=20 > world. >=20 > That's probably enough character set info to get you into trouble. ;-) Mild trouble anyway, no burnings, no heresy trials. The ISO-10646 standard = does favor using the correct BOM at the start of UTF-8, UCS-2 and UCS-4. U= nicode is an extended version of UCS-2 to include pages other than the 1064= 6 BMP (Basic multilingual plane). Using a BOM with Unicode may mislead a p= rogram reading the file. The problem is not telling Unicode from UCS-2 whe= n they are different. There no differences between Unicode and UCS-2 and un= less those extra pages are used. Files in most languages will be identical= . Even Japanese and Chinese may not be detectable--unless you omit the BOM= for Unicode files. ;-) > > Really ?? You're sayin' there position such as Wide_Character'Val(X)=20 > > doesn't correspond to the Xth character in the UNICODE standard ?? Whoo boy, digging a deep hole here. You have to keep in mind that there are= at least three character sets that matter when you are programming in Ada = (or any other language.) First, there is the character set that you use to create the program. The = Ada standard provides a default, and it is the one that the compiler tests = use. But it is only a default, and GNAT accepts source in different formats= . Back when Ada was new, there were compilers for programs written in IBM's= EBCDIC. The second character set you care about (or set of them) are the Ada Charac= ter type, and other character types. In the IBM compiler above Character c= orresponded to ASCII as expected. The ordering of character literals was A= SCII not EBCDIC, etc. The third group of character sets are those that correspond to printers, di= splays and keyboards. If you need to write code that supports, say Cyrilli= c terminals, you may end up with strings that are really in say Russian. B= est to gather them all in one "Language" package, to make it easier when yo= u have to do Ukrainian. :-( If all three character sets are the same, that's nice. But it can lead to = sloppy thinking. Way back when the ARG was wrestling with this, getting e= veryone on the same page about which set of character sets we were discussi= ng now, allowed us to get things into reasonable shape going into the Ada 9= X development. You want your compiler to allow Shift-JIS in comments? Sur= e. Just remember that an end of line, and only an end of line terminates a= comment.