From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 2002:a24:1c09:: with SMTP id c9-v6mr1913398itc.15.1523457176271; Wed, 11 Apr 2018 07:32:56 -0700 (PDT) X-Received: by 2002:a9d:5888:: with SMTP id x8-v6mr307774otg.0.1523457176113; Wed, 11 Apr 2018 07:32:56 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!news.uzoreto.com!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!k65-v6no181723ita.0!news-out.google.com!u64-v6ni735itb.0!nntp.google.com!k65-v6no181720ita.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Wed, 11 Apr 2018 07:32:55 -0700 (PDT) In-Reply-To: <7d5b8717-1e70-4153-af13-dfab24679ed9@googlegroups.com> Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=47.185.233.194; posting-account=zwxLlwoAAAChLBU7oraRzNDnqQYkYbpo NNTP-Posting-Host: 47.185.233.194 References: <7d5b8717-1e70-4153-af13-dfab24679ed9@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: Subject: Re: The extension of Is_Basic to unicode (about AI12-0260-1) From: "Dan'l Miller" Injection-Date: Wed, 11 Apr 2018 14:32:56 +0000 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Xref: reader02.eternal-september.org comp.lang.ada:51442 Date: 2018-04-11T07:32:55-07:00 List-Id: On Tuesday, April 10, 2018 at 7:52:34 PM UTC-5, ytomino wrote: > AI12-0260-1/04 Functions Is_Basic and To_Basic in Wide_Characters.Handlin= g > http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai12s/ai12-0260-1.txt?rev=3D1.= 5&raw=3DN >=20 > ...Has already been formally adopted into RM? (status is "Amendment") >=20 > I found inconsistency between existing Characters.Handling.Is_Basic and n= ew Wide_Characters.Handling.Is_Basic. >=20 > Characters.Handling.Is_Basic in RM: >=20 > True if Item is a basic letter. A basic letter is a character that is = in one of the ranges 'A'..'Z' and 'a'..'z', or that is one of the following= : '=C3=86', '=C3=A6', '=C3=90', '=C3=B0', '=C3=9E', '=C3=BE', or '=C3=9F'. If this Ada-specific definition of this is-basic/base-Latin-letter property= is the official normative list, then it seems rather arbitrary and caprici= ous, not conforming to Unicode or to linguistic reality. In Unicode-speak's terminology/jargon, the definition of base character at = https://definedterm.com/a/definition/160575 would admit quite a few more, b= ecause that definition says that any time that an NDC form results in a mul= ti-grapheme sequence, the base character as per https://definedterm.com/a/d= efinition/160575 would be the first grapheme, lopping off the combining-gra= phemes that follow. (For now, let's require that the NDC absolutely must r= esult in a multi-grapheme sequence which absolutely must have at least one = combining-grapheme, so that, say, LATIN CAPITAL LETTER WYNN =C7=B7 U+187 an= d the misnamed (omitting SMALL) LATIN LETTER WYNN =C6=BF and LATIN CAPITAL = LETTER YOGH =C8=9C U+21C and LATIN SMALL LETTER YOGH =C8=9D U+21D are all a= ssured to be elided from the is-basic/base-character list simply because no= one ever in the history of humanity ever attached a diacritical mark to wy= nn and yogh. Poor wynn and yogh.) LATIN CAPITAL LETTER AE =C3=86 U+C6 is a base character as per https://defi= nedterm.com/a/definition/160575 because diacritics were attached to it in L= ATIN CAPITAL LETTER AE WITH ACUTE =C7=BC U+1FC and LATIN CAPITAL LETTER AE = WITH MACRON =C7=A2 U+1E2. Likewise for =C3=A6 as a base character due to {= =C7=BD, =C7=A3}. But surprisingly in the Ada-specific variant of defining base-character for= the is-basic property, LATIN CAPITAL LETTER EZH =C6=B7 U+1B7 is for some a= rbitrary and capricious reason not a base-character despite the standardiza= tion of LATIN CAPITAL LETTER EZH WITH CARON =C7=AE U+1EE. Likewise for its= lower-case analogue =CA=92 due to =C7=AF Likewise surprisingly, in the Ada= -specific variant of definition of defining a base-character for the is-bas= ic property, LATIN SMALL LETTER TURNED O OPEN-O =EA=AD=83 U+AB43 is for som= e arbitrary and capricious reason not a base-character despite the standard= ization of LATIN SMALL LETTER TURNED O OPEN-O WITH STROKE =EA=AD=84 U+AB44.= If the =C3=86 ligature is-basic as a base character, then why isn't the d= irectly analogous =EA=AD=83 ligature? I am sure that a Unicode Consortium (or ISO10646) member/contributor/nation= al-resperesantative could find other examples of missing base characters fr= om the Ada-specific definition of is-basic using the definition at https://= definedterm.com/a/definition/160575 (I am pretty sure that a Unicode Consortium (or ISO10646) member/contributo= r/national-respresentative would say something to the effect of bah humbug = to Ada's is-basic property and advise a) using the first character of NDC instead as the definition of base-chara= cter [even when NDC is a single grapheme without trailing combining-graphem= es] or b) having a new function whose behavior would be same variant of true-or-fa= lse-would-NDC-be-a-single-grapheme-or-a-multigrapheme-sequence, whose name = could be is_diacriticable or is_elligible_for_combining_characters.)