From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail From: "Dmitry A. Kazakov" Newsgroups: comp.lang.ada Subject: Re: unicode and wide_text_io Date: Thu, 28 Dec 2017 10:04:41 +0100 Organization: Aioe.org NNTP Server Message-ID: References: NNTP-Posting-Host: TliDXSPe+gBSGCqP3SEJ2Q.user.gioia.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 X-Notice: Filtered by postfilter v. 0.8.2 Content-Language: en-US Xref: reader02.eternal-september.org comp.lang.ada:49669 Date: 2017-12-28T10:04:41+01:00 List-Id: On 2017-12-27 23:32, Mehdi Saada wrote: > Fundamentaly, how can a UTF8 string even represent codepoints next to the 255th ?? UTF-8 uses a chain code to represent large integers. ASCII 7-bit is coded as-as. Other characters require more than one octet. It is a technique widely used in communication for lossless compression. The drawback is that you cannot directly index characters in an UTF-8 string. But virtually no text processing algorithm need that. So not a loss, actual. In short, representation unit (octet) /= represented thing (character). > Superscripts and subscripts means more change in the IO package. > Before I could simply use the generic Integer_IO, but I have no clue > how to do to output a specific code point for each digit in a > specific base... wouldn't that mean rewriting part of Integer_IO ? You mean the standard library Integer_IO? Sure, you will have to replace it. > I may have a rather very shallow understanding of characters > encoding and representation, and that's quite an understatement, but > you said: "Ada's Character has Latin-1 encoding which differs from > UTF-8 in the code positions greater than 127" > Really ?? Yep. Latin-1 and UTF-8 have different representation. Both have ASCII 7-bit as a subset. > You're sayin' there position such as Wide_Character'Val(X) > doesn't correspond to the Xth character in the UNICODE standard ?? Character = Latin-1 Wide_Character = UCS-2 Wide_Wide_Character = UCS-4 Linux uses UTF-8 (for a long time). Windows uses either ASCII (so-called A-calls) or UTF-16 (so-called W-calls). There was a time, long ago, when Windows used UCS-2, but then they ditched it for UTF-16. Now, Ada programmers insolently ignore the standard and pragmatically use: Character = representation unit of UTF-8 (octet) Wide_Character = representation unit of UTF-16 Wide_Wide_Character = UNICODE code point This works most of the time, but one should be careful. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de