From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=BAYES_00,PDS_OTHER_BAD_TLD autolearn=no autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!nntp-feed.chiark.greenend.org.uk!ewrotcd!newsfeed.xs3.de!io.xs3.de!news.jacob-sparre.dk!franka.jacob-sparre.dk!pnx.dk!.POSTED.rrsoftware.com!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: unicode and wide_text_io Date: Wed, 27 Dec 2017 17:57:59 -0600 Organization: JSA Research & Innovation Message-ID: References: Injection-Date: Wed, 27 Dec 2017 23:58:00 -0000 (UTC) Injection-Info: franka.jacob-sparre.dk; posting-host="rrsoftware.com:24.196.82.226"; logging-data="8448"; mail-complaints-to="news@jacob-sparre.dk" X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Original X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246 Xref: reader02.eternal-september.org comp.lang.ada:49666 Date: 2017-12-27T17:57:59-06:00 List-Id: "Mehdi Saada" <00120260a@gmail.com> wrote in message news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com... >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably >> meant output of code points. That is a different beast. Convert a code >> point to UTF-8 string and output that. E.g. > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string > even represent > codepoints next to the 255th ?? Easy: it uses a variable-width representation. > I may have a rather very shallow understanding of characters encoding and > representation, That's the problem. Unless you can stick to Latin-1, you'll need to fix that understanding before contining. In Ada, type Character = Latin-1 = first 255 code positions, 8-bit representation. Text_IO and type String are for Latin-1 strings. type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code positions = UCS-2 = 16-bit representation. type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation. There is no native support in Ada for UTF-8 or UTF-16 strings. There is a conversion package (Ada.Strings.Encoding) [which is nasty because it breaks strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 (there is no good way to tell between them in the general case). Windows uses a BOM character at the start of UTF-8 files to differentiate (at least in programs like Notepad and the built-in edit control), but that is not recommended by Unicode. I think they would prefer a world where Latin-1 had disappeared completely, but that of course is not the real world. That's probably enough character set info to get you into trouble. ;-) Randy. and that's quite an understatement, but you said: "Ada's Character has Latin-1 encoding which differs from UTF-8 in the code positions greater than 127" > Really ?? You're sayin' there position such as Wide_Character'Val(X) > doesn't correspond to the Xth character in the UNICODE standard ?? > And I know peanuts about the UCS-2 thing. I'm too ignorant for getting one > bit of your saying, except it sounds like heresy in the ears of the Ada > Church. Burn them all !! > Ada.stream permits output of bits without any formatting, right ? If so, > it might do.