From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
X-Google-Thread: 103376,43ab55a75a8b5d1
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news2.google.com!news2.google.com!news4.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!nx01.iad01.newshosting.com!newshosting.com!newsfeed.icl.net!newsfeed.fjserv.net!colt.net!feeder.news-service.com!newsfeed.freenet.de!ecngs!feeder2.ecngs.de!news.osn.de!diablo2.news.osn.de!news.belwue.de!newsfeed.arcor.de!news.arcor.de!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Subject: Re: System.WCh_Cnv
Newsgroups: comp.lang.ada
User-Agent: 40tude_Dialog/2.0.15.1
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Reply-To: mailbox@dmitry-kazakov.de
Organization: cbb software GmbH
References: <EBEKJMEEPPFAACCBBGNHAELNDIAA.randy@rrsoftware.com>
 <mailman.47.1153823488.30988.comp.lang.ada@ada-france.org>
 <1nbqjel4blzuj$.obwkz78gfdph$.dlg@40tude.net>
 <mailman.48.1153832611.30988.comp.lang.ada@ada-france.org>
Date: Tue, 25 Jul 2006 15:36:42 +0200
Message-ID: <f9r2o22pm0ot$.184kj5ela2gcb.dlg@40tude.net>
NNTP-Posting-Date: 25 Jul 2006 15:36:42 MEST
NNTP-Posting-Host: 2ca09ac3.newsread4.arcor-online.net
X-Trace: 
 DXC=YKPeV]=]o\XghFd\k@b23T:ejgIfPPldTjW\KbG]kaMXea\9g\;7NmUSW3;h_FolCU[6LHn;2LCV^7enW;^6ZC`TIXm65S@:3>_
X-Complaints-To: usenet-abuse@arcor.de
Xref: g2news2.google.com comp.lang.ada:5918
Date: 2006-07-25T15:36:42+02:00
List-Id: <comp.lang.ada>

On Tue, 25 Jul 2006 14:03:21 +0100, Marius Amado-Alves wrote:

>> So I'm quite happy with UTF-8 and plain strings.
> 
> I am more or less happy with this too [1], but I think we can do  
> better. With UTF-8 in strings the two abstractions (codepoints,  
> encodings) are too entangled for my taste. In rigour you cannot use  
> the standard string operations.

Yes, not all of them.

> I mean you can but must fiddle with  
> the encodings i.e. you are not searching for a codepoint but for a  
> particular encoding. Instead I want to be able to write things like
> 
> for I in Str'Range loop
>     if Str (I) = Euro_Sign then ...
> end loop;
>
> I cannot do that with UTF-8 in strings.

I do it this way:

declare
   Index : Integer := Str'First;
   Value : UTF8_Code_Point;  
begin
   while Index <= Str'Last loop
      Get (Str, Index, Value);
      if Euro_Sign then ...
   end loop;

Actually if Ada had abstract array interfaces and inheritance we could have
it in exactly the form you wrote it. Alas.

Note that the pattern you refer is beyond just Unicode issues. Exactly the
same problem exists in pattern matching:

while Index <= Str'Last loop
    if Match (Str, Index, Pattern) then ...
end loop;

Basically it is a stream interface to strings with an ability to roll it
back or, equivalently, to look ahead.

> Note that Wide_Wide_String is  
> of little help here, because of the endianess issue. But it might be  
> a good idea to base Unico on Wide_Wide_String for closeness to the  
> standard.

I prefer general solutions, like array interfaces. You have an opaque
object. Add an array interface to it, which would return code points or
Wide_x_100_Character or whatever you want. Here you are.

> [1] What makes me happy about UTF-8 is that it seems to have become a  
> de facto default, common denominator encoding.

Long live Linux! (:-))

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de