From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 2002:a6b:3c0e:: with SMTP id
 k14-v6mr870481iob.105.1530715404689;
        Wed, 04 Jul 2018 07:43:24 -0700 (PDT)
X-Received: by 2002:aca:75c9:: with SMTP id
 q192-v6mr459187oic.3.1530715404529;
 Wed, 04 Jul 2018 07:43:24 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!news.uzoreto.com!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!u78-v6no1739383itb.0!news-out.google.com!l67-v6ni1673itl.0!nntp.google.com!d7-v6no1744116itj.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Wed, 4 Jul 2018 07:43:24 -0700 (PDT)
In-Reply-To: <d35454dc-f982-49d7-b727-45a9cc69822b@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com;
 posting-host=47.185.195.62;
 posting-account=zwxLlwoAAAChLBU7oraRzNDnqQYkYbpo
NNTP-Posting-Host: 47.185.195.62
References: <70c11a71-3832-4f57-8127-f3f1c48a052f@googlegroups.com>
 <ly1scotsqq.fsf@pushface.org>
 <62e38ee4-f72f-4ed8-bef1-952040fb7f8d@googlegroups.com>
 <lytvpks65b.fsf@pushface.org>
 <64d8b4a1-a92c-4b90-b95c-e821749de969@googlegroups.com>
 <lya7rc9iw0.fsf@pushface.org>
 <887212304.552080112.848502.laguest-archeia.com@nntp.aioe.org>
 <87muvan83x.fsf@adaheads.home> <ly4lhiafs8.fsf@pushface.org>
 <1449870001.552246132.581310.laguest-archeia.com@nntp.aioe.org>
 <lyzhz98lvh.fsf@pushface.org>
 <b0d7482d-3c02-4e0b-8720-58ee5b65af03@googlegroups.com>
 <phg0h7$10dd$1@gioia.aioe.org>
 <c980d621-6d5d-4a23-8005-733bb024285d@googlegroups.com>
 <phg5nk$1a46$1@gioia.aioe.org> <phg6cg$1ba2$1@gioia.aioe.org>
 <bd52280b-662a-49b3-891d-e39044e2bf32@googlegroups.com>
 <phg8lo$1fnq$1@gioia.aioe.org>
 <phht7f$1vj2$1@gioia.aioe.org> <phhuei$1v6$1@gioia.aioe.org>
 <phi5hp$fbv$1@gioia.aioe.org> <phi5t5$g0q$1@gioia.aioe.org>
 <phib5i$pt9$1@gioia.aioe.org> <phii0m$1698$1@gioia.aioe.org>
 <d35454dc-f982-49d7-b727-45a9cc69822b@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <5611f9a5-508b-4846-9d53-4a05599f7f53@googlegroups.com>
Subject: Re: Strange crash on custom iterator
From: "Dan'l Miller" <optikos@verizon.net>
Injection-Date: Wed, 04 Jul 2018 14:43:24 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Xref: reader02.eternal-september.org comp.lang.ada:53591
Date: 2018-07-04T07:43:24-07:00
List-Id: <comp.lang.ada>

On Wednesday, July 4, 2018 at 9:37:40 AM UTC-5, Dan'l Miller wrote:
> On Wednesday, July 4, 2018 at 8:27:53 AM UTC-5, Dmitry A. Kazakov wrote:
> > On 2018-07-04 13:30, J-P. Rosen wrote:
> > > Le 04/07/2018 =C3=A0 12:01, Dmitry A. Kazakov a =C3=A9crit=C2=A0:
> > >> But UTF-8 is actually more efficient in most cases than
> > >> Wide_Wide_String. Random string indexing is practically never used.
> > > !!!! I, and many others, often need to search substrings within a
> > > string; actually, I would have a hard time finding an example of stri=
ng
> > > manipulation without indexing...
> > >=20
> > >>> We discussed that point, and the agreement was that making a differ=
ent
> > >>> type would force the user to many conversions that would bring noth=
ing
> > >>> but trouble, and make Ada once again look impractical out of excess=
ive
> > >>> purism.
> > >>
> > >> Exactly my point. Explicit conversion are necessary because Ada's ty=
pe
> > >> system is unable to model strings in a type-safe way.
> > > So, you want different types, plus a typing system that would allow t=
o
> > > mix the types and make them compatible.
> >=20
> > Yes, because they are semantically same: arrays of code points.
> >=20
> > > .. You might as well put
> > > everything in the same type!
> >=20
> > No, because they must have different representations.
> >=20
> > > Anyway, the ARG has to deal with Ada as it is, not as Dmitry dreams i=
t
> > > should be...
> >=20
> > It requires someone more influential, wise and knowledgeable than me to=
=20
> > make and then push such a proposal. I would be satisfied if more people=
=20
> > saw the roots of problems with strings etc.
>=20
> I think that perhaps /all/ readers of this see at least one =E2=80=A2prob=
lem=E2=80=A2 with UTF-8 (and perhaps Unicode/ISO10646 in general in Ada, re=
gardless of choice of encoding) in Ada's String (and perhaps Wide_String an=
d Wide_Wide_String too).
>=20
> The difficulty is that =E2=80=A2no one=E2=80=A2 has the single =E2=80=A2s=
olution=E2=80=A2 for this problem or these concomitant problems.  Not even =
J-P. Rosen is a possessor of complete solution in his Wide_Wide_String reco=
mmendation, because his replies seem to factually-incorrectly imply that th=
ere exists a fully-normalized single-codepoint character in Unicode/ISO1064=
6 for each grapheme/letter.  The following article provides 7 examples in 4=
 languages (2 of which are European languages, no less!) where a single gra=
pheme's most-compact representation in Unicode/ISO10646 is a multi-codepoin=
t sequence.
>=20
> The absolutely most infamous of these 7 examples is the Lithuanian one.  =
Because through flukes of sociopolitical history, Vietnamese, French, Germa=
n, and so forth all had pre-1992 ISO standards or IBM-Microsoft-Apple code-=
pages for their letters with diacritics, their languages' letters with diac=
ritics got standardized in Unicode/ISO10646 as single codepoints, e.g., =C3=
=BC as U+FC instead of =C2=A8 U+308 followed by u U+75.  Poor old Lithuania=
 was under Soviet occupation from 1944 to 1991, during which the Soviets tr=
ied to suppress the Lithuanian language.  Due to this suppression, the Sovi=
et character-encoding standards never standardized encodings for Lithuanian=
 letters with all the Lithuanian-specific diacritical marks, such as the 2 =
example letters given in the article linked above.  Because the timespan wa=
s so short from the Soviet occupation leaving Lithuania in 1991 to the 1992=
 cut-off of pre-existing character-encoding standards to which Unicode/ISO1=
0646 must be encode as single codepoints, poor old Lithuanian characters ar=
e 2nd-class citizens in Unicode/ISO10646, whereas all the Western European =
languages (and their former colonies) with diacritical marks are first-clas=
s citizens in Unicode/ISO10646.  This is a cause of somewhat of a protracte=
d slow-motion multidecade trench warfare between Lithuania and Unicode/ISO1=
0646 over this issue, made worse every time someone elsewhere on the planet=
 whips up a brand-new character-with-single-codepoint that has never ever e=
xisted in the history of humankind and then standardizes this brand-new con=
trived grapheme-with-single-codepoint in Unicode/ISO10646.
>=20
> Oh, but Japan and Silicon Valley can devise emojis galore in recent years=
 and not be restricted by strict enforcement of this no-preexisting-charact=
er-encoding rule.  Why?  I guess because emojis are cool, but Lithuanian ch=
aracters are booooorrrrrrrring.

Oh, it would help if I would press the paste key:
http://unicode.org/standard/where