From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail From: "Dmitry A. Kazakov" Newsgroups: comp.lang.ada Subject: Re: A few questions on parsing, sockets, UTF-8 strings Date: Thu, 11 Aug 2016 23:10:31 +0200 Organization: Aioe.org NNTP Server Message-ID: References: <267bd80f-b388-4df6-b712-315ee9bda2b8@googlegroups.com> <90caee48-5fa7-47d7-aad5-761e11225e2c@googlegroups.com> <4c6509a9-5ff2-4f94-b2c3-55d89ca2b076@googlegroups.com> NNTP-Posting-Host: xelDFTENDI+dlkJFd2Ot2w.user.gioia.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 X-Notice: Filtered by postfilter v. 0.8.2 Xref: news.eternal-september.org comp.lang.ada:31405 Date: 2016-08-11T23:10:31+02:00 List-Id: On 2016-08-11 20:22, john@peppermind.com wrote: > On Thursday, August 11, 2016 at 6:49:33 PM UTC+1, Dmitry A. Kazakov wrote: > > You're right, Ascii uses only 0...127 as code points. But I thought > that Ada fixed strings hold one byte per character, meaning that I can > store UTF-8 in it? You can. Formally you should not, because RM 3.5.2 defines Character as Latin-1, but in practice nobody cares. > Sorry for being such a noob, but I still don't get it. If GNAT GPS > is set to UTF-8 (-gnatW8 for gnatmake and source encoding in GPS > preferences), doesn't that mean that if I enter a Unicode character into > a fixed string literal (just String, not Wide_String or > Wide_Wide_String) that the string will contain this character in the > form of as many bytes as the Unicode code point requires? Source encoding is not the encoding of program strings. You better not use non-ASCII literals if you want to have sources portable. If you need some Unicode code points, use explicit conversions. With this: http://dmitry-kazakov.de/ada/strings_edit.htm#7.1 Left_Arrow : constant String := Strings_Edit.UTF8.Image (16#2190#); > So if it's a > two-byte UTF-8 code point, then the string will contain two bytes? Code point numeric representation is longer than 2 bytes. The range is 0..16#10FFFF#. The length of the UTF-8 representation depends on the code point value. It can be longer than 2 bytes. The maximal length of UCS-2 character is 3 bytes in UTF-8. > In that case, as long as I don't need to access single characters > ever, could I stick with fixed strings? Yes. But there is no problem accessing single code points either. UTF-8 was designed for easy forward and backward navigation. In the package above Get takes code points moving forward and Get_Backwards does it backwards. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de