From: "Dan'l Miller" <optikos@verizon.net>
Subject: Re: Strange crash on custom iterator
Date: Wed, 4 Jul 2018 07:43:24 -0700 (PDT)
Date: 2018-07-04T07:43:24-07:00 [thread overview]
Message-ID: <5611f9a5-508b-4846-9d53-4a05599f7f53@googlegroups.com> (raw)
In-Reply-To: <d35454dc-f982-49d7-b727-45a9cc69822b@googlegroups.com>
On Wednesday, July 4, 2018 at 9:37:40 AM UTC-5, Dan'l Miller wrote:
> On Wednesday, July 4, 2018 at 8:27:53 AM UTC-5, Dmitry A. Kazakov wrote:
> > On 2018-07-04 13:30, J-P. Rosen wrote:
> > > Le 04/07/2018 à 12:01, Dmitry A. Kazakov a écrit :
> > >> But UTF-8 is actually more efficient in most cases than
> > >> Wide_Wide_String. Random string indexing is practically never used.
> > > !!!! I, and many others, often need to search substrings within a
> > > string; actually, I would have a hard time finding an example of string
> > > manipulation without indexing...
> > >
> > >>> We discussed that point, and the agreement was that making a different
> > >>> type would force the user to many conversions that would bring nothing
> > >>> but trouble, and make Ada once again look impractical out of excessive
> > >>> purism.
> > >>
> > >> Exactly my point. Explicit conversion are necessary because Ada's type
> > >> system is unable to model strings in a type-safe way.
> > > So, you want different types, plus a typing system that would allow to
> > > mix the types and make them compatible.
> >
> > Yes, because they are semantically same: arrays of code points.
> >
> > > .. You might as well put
> > > everything in the same type!
> >
> > No, because they must have different representations.
> >
> > > Anyway, the ARG has to deal with Ada as it is, not as Dmitry dreams it
> > > should be...
> >
> > It requires someone more influential, wise and knowledgeable than me to
> > make and then push such a proposal. I would be satisfied if more people
> > saw the roots of problems with strings etc.
>
> I think that perhaps /all/ readers of this see at least one •problem• with UTF-8 (and perhaps Unicode/ISO10646 in general in Ada, regardless of choice of encoding) in Ada's String (and perhaps Wide_String and Wide_Wide_String too).
>
> The difficulty is that •no one• has the single •solution• for this problem or these concomitant problems. Not even J-P. Rosen is a possessor of complete solution in his Wide_Wide_String recommendation, because his replies seem to factually-incorrectly imply that there exists a fully-normalized single-codepoint character in Unicode/ISO10646 for each grapheme/letter. The following article provides 7 examples in 4 languages (2 of which are European languages, no less!) where a single grapheme's most-compact representation in Unicode/ISO10646 is a multi-codepoint sequence.
>
> The absolutely most infamous of these 7 examples is the Lithuanian one. Because through flukes of sociopolitical history, Vietnamese, French, German, and so forth all had pre-1992 ISO standards or IBM-Microsoft-Apple code-pages for their letters with diacritics, their languages' letters with diacritics got standardized in Unicode/ISO10646 as single codepoints, e.g., ü as U+FC instead of ¨ U+308 followed by u U+75. Poor old Lithuania was under Soviet occupation from 1944 to 1991, during which the Soviets tried to suppress the Lithuanian language. Due to this suppression, the Soviet character-encoding standards never standardized encodings for Lithuanian letters with all the Lithuanian-specific diacritical marks, such as the 2 example letters given in the article linked above. Because the timespan was so short from the Soviet occupation leaving Lithuania in 1991 to the 1992 cut-off of pre-existing character-encoding standards to which Unicode/ISO10646 must be encode as single codepoints, poor old Lithuanian characters are 2nd-class citizens in Unicode/ISO10646, whereas all the Western European languages (and their former colonies) with diacritical marks are first-class citizens in Unicode/ISO10646. This is a cause of somewhat of a protracted slow-motion multidecade trench warfare between Lithuania and Unicode/ISO10646 over this issue, made worse every time someone elsewhere on the planet whips up a brand-new character-with-single-codepoint that has never ever existed in the history of humankind and then standardizes this brand-new contrived grapheme-with-single-codepoint in Unicode/ISO10646.
>
> Oh, but Japan and Silicon Valley can devise emojis galore in recent years and not be restricted by strict enforcement of this no-preexisting-character-encoding rule. Why? I guess because emojis are cool, but Lithuanian characters are booooorrrrrrrring.
Oh, it would help if I would press the paste key:
http://unicode.org/standard/where
next prev parent reply other threads:[~2018-07-04 14:43 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-30 10:48 Strange crash on custom iterator Lucretia
2018-06-30 11:32 ` Simon Wright
2018-06-30 12:02 ` Lucretia
2018-06-30 14:25 ` Simon Wright
2018-06-30 14:33 ` Lucretia
2018-06-30 19:25 ` Simon Wright
2018-06-30 19:36 ` Luke A. Guest
2018-07-01 18:06 ` Jacob Sparre Andersen
2018-07-01 19:59 ` Simon Wright
2018-07-02 17:43 ` Luke A. Guest
2018-07-02 19:42 ` Simon Wright
2018-07-03 14:08 ` Lucretia
2018-07-03 14:17 ` J-P. Rosen
2018-07-03 15:06 ` Lucretia
2018-07-03 15:45 ` J-P. Rosen
2018-07-03 15:55 ` Lucretia
2018-07-03 17:00 ` J-P. Rosen
2018-07-03 15:57 ` Dmitry A. Kazakov
2018-07-03 16:07 ` Lucretia
2018-07-03 16:36 ` Dmitry A. Kazakov
2018-07-03 16:42 ` Lucretia
2018-07-03 16:45 ` Lucretia
2018-07-03 20:18 ` Dmitry A. Kazakov
2018-07-03 21:04 ` Lucretia
2018-07-04 1:26 ` Dan'l Miller
2018-07-04 1:59 ` Lucretia
2018-07-04 7:37 ` Dmitry A. Kazakov
2018-07-04 12:46 ` Dan'l Miller
2018-07-04 13:37 ` Dennis Lee Bieber
2018-07-04 7:21 ` Dmitry A. Kazakov
2018-07-03 18:54 ` Dan'l Miller
2018-07-03 20:22 ` Dmitry A. Kazakov
2018-07-04 7:33 ` J-P. Rosen
2018-07-04 7:53 ` Dmitry A. Kazakov
2018-07-04 9:55 ` J-P. Rosen
2018-07-04 10:01 ` Dmitry A. Kazakov
2018-07-04 11:30 ` J-P. Rosen
2018-07-04 13:27 ` Dmitry A. Kazakov
2018-07-04 14:37 ` Dan'l Miller
2018-07-04 14:43 ` Dan'l Miller [this message]
2018-07-04 14:57 ` J-P. Rosen
2018-07-04 15:41 ` Lucretia
2018-07-04 16:55 ` Dan'l Miller
2018-07-04 18:01 ` Shark8
2018-07-04 18:57 ` Dmitry A. Kazakov
2018-07-04 19:53 ` Shark8
2018-07-04 20:05 ` Lucretia
2018-07-04 22:04 ` Shark8
2018-07-05 0:12 ` Dan'l Miller
2018-07-05 1:46 ` Shark8
2018-07-05 2:07 ` Luke A. Guest
2018-07-05 16:47 ` Shark8
2018-07-05 17:19 ` Dan'l Miller
2018-07-05 19:14 ` Shark8
2018-07-04 20:43 ` Dmitry A. Kazakov
2018-07-04 17:51 ` Jacob Sparre Andersen
2018-07-04 18:06 ` Shark8
2018-07-04 18:59 ` Dan'l Miller
2018-07-04 19:01 ` Dmitry A. Kazakov
2018-07-05 18:08 ` Randy Brukardt
2018-07-05 19:41 ` Dmitry A. Kazakov
2018-07-04 21:00 ` Jacob Sparre Andersen
2018-07-05 18:06 ` Randy Brukardt
2018-07-04 19:02 ` G. B.
2018-07-04 19:16 ` Dmitry A. Kazakov
2018-07-04 20:40 ` G. B.
2018-07-04 20:55 ` Dmitry A. Kazakov
2018-07-04 21:21 ` G.B.
2018-07-05 7:55 ` Dmitry A. Kazakov
2018-07-06 8:28 ` G.B.
2018-07-06 8:57 ` Dmitry A. Kazakov
2018-07-02 8:31 ` Lucretia
2018-06-30 14:34 ` Lucretia
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox