comp.lang.ada
 help / color / mirror / Atom feed
From: "Dan'l Miller" <optikos@verizon.net>
Subject: Re: The extension of Is_Basic to unicode (about AI12-0260-1)
Date: Wed, 11 Apr 2018 07:32:55 -0700 (PDT)
Date: 2018-04-11T07:32:55-07:00	[thread overview]
Message-ID: <b76d295b-385b-41b1-b9ea-b523911b9266@googlegroups.com> (raw)
In-Reply-To: <7d5b8717-1e70-4153-af13-dfab24679ed9@googlegroups.com>

On Tuesday, April 10, 2018 at 7:52:34 PM UTC-5, ytomino wrote:
> AI12-0260-1/04 Functions Is_Basic and To_Basic in Wide_Characters.Handling
> http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai12s/ai12-0260-1.txt?rev=1.5&raw=N
> 
> ...Has already been formally adopted into RM? (status is "Amendment")
> 
> I found inconsistency between existing Characters.Handling.Is_Basic and new Wide_Characters.Handling.Is_Basic.
> 
> Characters.Handling.Is_Basic in RM:
> 
>    True if Item is a basic letter. A basic letter is a character that is in one of the ranges 'A'..'Z' and 'a'..'z', or that is one of the following: 'Æ', 'æ', 'Ð', 'ð', 'Þ', 'þ', or 'ß'.

If this Ada-specific definition of this is-basic/base-Latin-letter property is the official normative list, then it seems rather arbitrary and capricious, not conforming to Unicode or to linguistic reality.

In Unicode-speak's terminology/jargon, the definition of base character at https://definedterm.com/a/definition/160575 would admit quite a few more, because that definition says that any time that an NDC form results in a multi-grapheme sequence, the base character as per https://definedterm.com/a/definition/160575 would be the first grapheme, lopping off the combining-graphemes that follow.  (For now, let's require that the NDC absolutely must result in a multi-grapheme sequence which absolutely must have at least one combining-grapheme, so that, say, LATIN CAPITAL LETTER WYNN Ƿ U+187 and the misnamed (omitting SMALL) LATIN LETTER WYNN ƿ and LATIN CAPITAL LETTER YOGH Ȝ U+21C and LATIN SMALL LETTER YOGH ȝ U+21D are all assured to be elided from the is-basic/base-character list simply because no one ever in the history of humanity ever attached a diacritical mark to wynn and yogh.  Poor wynn and yogh.)

LATIN CAPITAL LETTER AE Æ U+C6 is a base character as per https://definedterm.com/a/definition/160575 because diacritics were attached to it in LATIN CAPITAL LETTER AE WITH ACUTE Ǽ U+1FC and LATIN CAPITAL LETTER AE WITH MACRON Ǣ U+1E2.  Likewise for æ as a base character due to {ǽ, ǣ}.

But surprisingly in the Ada-specific variant of defining base-character for the is-basic property, LATIN CAPITAL LETTER EZH Ʒ U+1B7 is for some arbitrary and capricious reason not a base-character despite the standardization of LATIN CAPITAL LETTER EZH WITH CARON Ǯ U+1EE.  Likewise for its lower-case analogue ʒ due to ǯ Likewise surprisingly, in the Ada-specific variant of definition of defining a base-character for the is-basic property, LATIN SMALL LETTER TURNED O OPEN-O ꭃ U+AB43 is for some arbitrary and capricious reason not a base-character despite the standardization of LATIN SMALL LETTER TURNED O OPEN-O WITH STROKE ꭄ U+AB44.  If the Æ ligature is-basic as a base character, then why isn't the directly analogous ꭃ ligature?

I am sure that a Unicode Consortium (or ISO10646) member/contributor/national-resperesantative could find other examples of missing base characters from the Ada-specific definition of is-basic using the definition at https://definedterm.com/a/definition/160575

(I am pretty sure that a Unicode Consortium (or ISO10646) member/contributor/national-respresentative would say something to the effect of bah humbug to Ada's is-basic property and advise
a) using the first character of NDC instead as the definition of base-character [even when NDC is a single grapheme without trailing combining-graphemes]
or
b) having a new function whose behavior would be same variant of true-or-false-would-NDC-be-a-single-grapheme-or-a-multigrapheme-sequence, whose name could be is_diacriticable or is_elligible_for_combining_characters.)

  parent reply	other threads:[~2018-04-11 14:32 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-11  0:52 The extension of Is_Basic to unicode (about AI12-0260-1) ytomino
2018-04-11  3:38 ` J-P. Rosen
2018-04-11  3:52   ` ytomino
2018-04-11 14:32 ` Dan'l Miller [this message]
2018-04-11 20:54   ` J-P. Rosen
2018-04-11 22:20     ` Randy Brukardt
2018-04-11 23:57       ` ytomino
2018-04-12  5:14         ` J-P. Rosen
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox