From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Newsgroups: comp.lang.ada
Subject: Re: A few questions on parsing, sockets, UTF-8 strings
Date: Thu, 11 Aug 2016 23:10:31 +0200
Organization: Aioe.org NNTP Server
Message-ID: <noipkf$tni$1@gioia.aioe.org>
References: <267bd80f-b388-4df6-b712-315ee9bda2b8@googlegroups.com>
 <noi8r3$2p6$1@gioia.aioe.org>
 <90caee48-5fa7-47d7-aad5-761e11225e2c@googlegroups.com>
 <noidr9$atk$1@gioia.aioe.org>
 <4c6509a9-5ff2-4f94-b2c3-55d89ca2b076@googlegroups.com>
NNTP-Posting-Host: xelDFTENDI+dlkJFd2Ot2w.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.2.0
X-Notice: Filtered by postfilter v. 0.8.2
Xref: news.eternal-september.org comp.lang.ada:31405
Date: 2016-08-11T23:10:31+02:00
List-Id: <comp.lang.ada>

On 2016-08-11 20:22, john@peppermind.com wrote:
> On Thursday, August 11, 2016 at 6:49:33 PM UTC+1, Dmitry A. Kazakov wrote:
>
> You're right, Ascii uses only 0...127 as code points. But I thought
> that Ada fixed strings hold one byte per character, meaning that I can
> store UTF-8 in it?

You can. Formally you should not, because RM 3.5.2 defines Character as 
Latin-1, but in practice nobody cares.

> Sorry for being such a noob, but I still don't get it. If GNAT GPS
> is  set to UTF-8 (-gnatW8 for gnatmake and source encoding in GPS
> preferences), doesn't that mean that if I enter a Unicode character into
> a fixed string literal (just String, not Wide_String or
> Wide_Wide_String) that the string will contain this character in the
> form of as many bytes as the Unicode code point requires?

Source encoding is not the encoding of program strings. You better not 
use non-ASCII literals if you want to have sources portable.

If you need some Unicode code points, use explicit conversions. With this:

    http://dmitry-kazakov.de/ada/strings_edit.htm#7.1

    Left_Arrow : constant String := Strings_Edit.UTF8.Image (16#2190#);

> So if it's a
> two-byte UTF-8 code point, then the string will contain two bytes?

Code point numeric representation is longer than 2 bytes. The range is 
0..16#10FFFF#. The length of the UTF-8 representation depends on the 
code point value. It can be longer than 2 bytes. The maximal length of 
UCS-2 character is 3 bytes in UTF-8.

> In that case, as long as I don't need to access single characters
> ever, could I stick with fixed strings?

Yes. But there is no problem accessing single code points either. UTF-8 
was designed for easy forward and backward navigation. In the package 
above Get takes code points moving forward and Get_Backwards does it 
backwards.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de