comp.lang.ada
 help / color / mirror / Atom feed
* The extension of Is_Basic to unicode (about AI12-0260-1)
@ 2018-04-11  0:52 ytomino
  2018-04-11  3:38 ` J-P. Rosen
  2018-04-11 14:32 ` Dan'l Miller
  0 siblings, 2 replies; 8+ messages in thread
From: ytomino @ 2018-04-11  0:52 UTC (permalink / raw)


AI12-0260-1/04 Functions Is_Basic and To_Basic in Wide_Characters.Handling
http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai12s/ai12-0260-1.txt?rev=1.5&raw=N

...Has already been formally adopted into RM? (status is "Amendment")

I found inconsistency between existing Characters.Handling.Is_Basic and new Wide_Characters.Handling.Is_Basic.

Characters.Handling.Is_Basic in RM:

   True if Item is a basic letter. A basic letter is a character that is in one of the ranges 'A'..'Z' and 'a'..'z', or that is one of the following: 'Æ', 'æ', 'Ð', 'ð', 'Þ', 'þ', or 'ß'.

Characters.H.Is_Basic includes only alphabet, not include other symbols.
Is_Basic ('+') = False.

Wide_Characters.Handling.Is_Basic in AI:

  Returns True if the Wide_Character designated by Item has no Decomposition Mapping in the code charts of ISO/IEC 10646:2017; otherwise returns False. 
  
Wide_Characters.H.Is_Basic includes all un-decomposable characters, called as "base character" in Unicode world. It include the symbols.
Is_Basic ('+') = True.

Perhaps, Is_Basic must be defined as the intersection of the set of base characters *and the set of letters* (categorized as 'Ll', 'Lu', 'Lt', 'Lm', 'Lo'... in Unicode Character Database).

Thanks.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The extension of Is_Basic to unicode (about AI12-0260-1)
  2018-04-11  0:52 The extension of Is_Basic to unicode (about AI12-0260-1) ytomino
@ 2018-04-11  3:38 ` J-P. Rosen
  2018-04-11  3:52   ` ytomino
  2018-04-11 14:32 ` Dan'l Miller
  1 sibling, 1 reply; 8+ messages in thread
From: J-P. Rosen @ 2018-04-11  3:38 UTC (permalink / raw)


Le 11/04/2018 à 02:52, ytomino a écrit :
> AI12-0260-1/04 Functions Is_Basic and To_Basic in Wide_Characters.Handling
> I found inconsistency between existing Characters.Handling.Is_Basic and new Wide_Characters.Handling.Is_Basic.
> 
> Characters.Handling.Is_Basic in RM:
> 
>    True if Item is a basic letter. A basic letter is a character that is in one of the ranges 'A'..'Z' and 'a'..'z', or that is one of the following: 'Æ', 'æ', 'Ð', 'ð', 'Þ', 'þ', or 'ß'.
> 
> Characters.H.Is_Basic includes only alphabet, not include other symbols.
> Is_Basic ('+') = False.
> 
> Wide_Characters.Handling.Is_Basic in AI:
> 
>   Returns True if the Wide_Character designated by Item has no Decomposition Mapping in the code charts of ISO/IEC 10646:2017; otherwise returns False. 
>   
> Wide_Characters.H.Is_Basic includes all un-decomposable characters, called as "base character" in Unicode world. It include the symbols.
> Is_Basic ('+') = True.
> 
> Perhaps, Is_Basic must be defined as the intersection of the set of base characters *and the set of letters* (categorized as 'Ll', 'Lu', 'Lt', 'Lm', 'Lo'... in Unicode Character Database).


Right, but the old definition was wrong and the new one is right. In
general, Ada prefers to use existing standards rather than inventing its
own special definitions. If you need to make sure that something is a
letter, there is the Is_Letter function.

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The extension of Is_Basic to unicode (about AI12-0260-1)
  2018-04-11  3:38 ` J-P. Rosen
@ 2018-04-11  3:52   ` ytomino
  0 siblings, 0 replies; 8+ messages in thread
From: ytomino @ 2018-04-11  3:52 UTC (permalink / raw)


On Wednesday, April 11, 2018 at 12:38:07 PM UTC+9, J-P. Rosen wrote:
> Le 11/04/2018 à 02:52, ytomino a écrit :
> > AI12-0260-1/04 Functions Is_Basic and To_Basic in Wide_Characters.Handling
> > I found inconsistency between existing Characters.Handling.Is_Basic and new Wide_Characters.Handling.Is_Basic.
> > 
> > Characters.Handling.Is_Basic in RM:
> > 
> >    True if Item is a basic letter. A basic letter is a character that is in one of the ranges 'A'..'Z' and 'a'..'z', or that is one of the following: 'Æ', 'æ', 'Ð', 'ð', 'Þ', 'þ', or 'ß'.
> > 
> > Characters.H.Is_Basic includes only alphabet, not include other symbols.
> > Is_Basic ('+') = False.
> > 
> > Wide_Characters.Handling.Is_Basic in AI:
> > 
> >   Returns True if the Wide_Character designated by Item has no Decomposition Mapping in the code charts of ISO/IEC 10646:2017; otherwise returns False. 
> >   
> > Wide_Characters.H.Is_Basic includes all un-decomposable characters, called as "base character" in Unicode world. It include the symbols.
> > Is_Basic ('+') = True.
> > 
> > Perhaps, Is_Basic must be defined as the intersection of the set of base characters *and the set of letters* (categorized as 'Ll', 'Lu', 'Lt', 'Lm', 'Lo'... in Unicode Character Database).
> 
> 
> Right, but the old definition was wrong and the new one is right. In
> general, Ada prefers to use existing standards rather than inventing its
> own special definitions. If you need to make sure that something is a
> letter, there is the Is_Letter function.
> 
> -- 
> J-P. Rosen
> Adalog
> 2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
> Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
> http://www.adalog.fr

> Right, but the old definition was wrong and the new one is right.

I agree with you on the point of the old definition is wrong.
However, should new function name be used for new definition?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The extension of Is_Basic to unicode (about AI12-0260-1)
  2018-04-11  0:52 The extension of Is_Basic to unicode (about AI12-0260-1) ytomino
  2018-04-11  3:38 ` J-P. Rosen
@ 2018-04-11 14:32 ` Dan'l Miller
  2018-04-11 20:54   ` J-P. Rosen
  1 sibling, 1 reply; 8+ messages in thread
From: Dan'l Miller @ 2018-04-11 14:32 UTC (permalink / raw)


On Tuesday, April 10, 2018 at 7:52:34 PM UTC-5, ytomino wrote:
> AI12-0260-1/04 Functions Is_Basic and To_Basic in Wide_Characters.Handling
> http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai12s/ai12-0260-1.txt?rev=1.5&raw=N
> 
> ...Has already been formally adopted into RM? (status is "Amendment")
> 
> I found inconsistency between existing Characters.Handling.Is_Basic and new Wide_Characters.Handling.Is_Basic.
> 
> Characters.Handling.Is_Basic in RM:
> 
>    True if Item is a basic letter. A basic letter is a character that is in one of the ranges 'A'..'Z' and 'a'..'z', or that is one of the following: 'Æ', 'æ', 'Ð', 'ð', 'Þ', 'þ', or 'ß'.

If this Ada-specific definition of this is-basic/base-Latin-letter property is the official normative list, then it seems rather arbitrary and capricious, not conforming to Unicode or to linguistic reality.

In Unicode-speak's terminology/jargon, the definition of base character at https://definedterm.com/a/definition/160575 would admit quite a few more, because that definition says that any time that an NDC form results in a multi-grapheme sequence, the base character as per https://definedterm.com/a/definition/160575 would be the first grapheme, lopping off the combining-graphemes that follow.  (For now, let's require that the NDC absolutely must result in a multi-grapheme sequence which absolutely must have at least one combining-grapheme, so that, say, LATIN CAPITAL LETTER WYNN Ƿ U+187 and the misnamed (omitting SMALL) LATIN LETTER WYNN ƿ and LATIN CAPITAL LETTER YOGH Ȝ U+21C and LATIN SMALL LETTER YOGH ȝ U+21D are all assured to be elided from the is-basic/base-character list simply because no one ever in the history of humanity ever attached a diacritical mark to wynn and yogh.  Poor wynn and yogh.)

LATIN CAPITAL LETTER AE Æ U+C6 is a base character as per https://definedterm.com/a/definition/160575 because diacritics were attached to it in LATIN CAPITAL LETTER AE WITH ACUTE Ǽ U+1FC and LATIN CAPITAL LETTER AE WITH MACRON Ǣ U+1E2.  Likewise for æ as a base character due to {ǽ, ǣ}.

But surprisingly in the Ada-specific variant of defining base-character for the is-basic property, LATIN CAPITAL LETTER EZH Ʒ U+1B7 is for some arbitrary and capricious reason not a base-character despite the standardization of LATIN CAPITAL LETTER EZH WITH CARON Ǯ U+1EE.  Likewise for its lower-case analogue ʒ due to ǯ Likewise surprisingly, in the Ada-specific variant of definition of defining a base-character for the is-basic property, LATIN SMALL LETTER TURNED O OPEN-O ꭃ U+AB43 is for some arbitrary and capricious reason not a base-character despite the standardization of LATIN SMALL LETTER TURNED O OPEN-O WITH STROKE ꭄ U+AB44.  If the Æ ligature is-basic as a base character, then why isn't the directly analogous ꭃ ligature?

I am sure that a Unicode Consortium (or ISO10646) member/contributor/national-resperesantative could find other examples of missing base characters from the Ada-specific definition of is-basic using the definition at https://definedterm.com/a/definition/160575

(I am pretty sure that a Unicode Consortium (or ISO10646) member/contributor/national-respresentative would say something to the effect of bah humbug to Ada's is-basic property and advise
a) using the first character of NDC instead as the definition of base-character [even when NDC is a single grapheme without trailing combining-graphemes]
or
b) having a new function whose behavior would be same variant of true-or-false-would-NDC-be-a-single-grapheme-or-a-multigrapheme-sequence, whose name could be is_diacriticable or is_elligible_for_combining_characters.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The extension of Is_Basic to unicode (about AI12-0260-1)
  2018-04-11 14:32 ` Dan'l Miller
@ 2018-04-11 20:54   ` J-P. Rosen
  2018-04-11 22:20     ` Randy Brukardt
  0 siblings, 1 reply; 8+ messages in thread
From: J-P. Rosen @ 2018-04-11 20:54 UTC (permalink / raw)


Le 11/04/2018 à 16:32, Dan'l Miller a écrit :
>> True if Item is a basic letter. A basic letter is a character that
>> is in one of the ranges 'A'..'Z' and 'a'..'z', or that is one of
>> the following: 'Æ', 'æ', 'Ð', 'ð', 'Þ', 'þ', or 'ß'.
> If this Ada-specific definition of this is-basic/base-Latin-letter
> property is the official normative list, then it seems rather
> arbitrary and capricious, not conforming to Unicode or to linguistic
> reality.
> 
> In Unicode-speak's terminology/jargon, the definition of base
> character at https://definedterm.com/a/definition/160575 would admit
> quite a few more, [...]
The above Is_Basic is about Character, and is defined only when using
Latin-1. Unicode is a different standard.


-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The extension of Is_Basic to unicode (about AI12-0260-1)
  2018-04-11 20:54   ` J-P. Rosen
@ 2018-04-11 22:20     ` Randy Brukardt
  2018-04-11 23:57       ` ytomino
  0 siblings, 1 reply; 8+ messages in thread
From: Randy Brukardt @ 2018-04-11 22:20 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1284 bytes --]

"J-P. Rosen" <rosen@adalog.fr> wrote in message 
news:palsmv$g18$1@gioia.aioe.org...
> Le 11/04/2018 à 16:32, Dan'l Miller a écrit :
>>> True if Item is a basic letter. A basic letter is a character that
>>> is in one of the ranges 'A'..'Z' and 'a'..'z', or that is one of
>>> the following: 'Æ', 'æ', 'Ð', 'ð', 'Þ', 'þ', or 'ß'.
>> If this Ada-specific definition of this is-basic/base-Latin-letter
>> property is the official normative list, then it seems rather
>> arbitrary and capricious, not conforming to Unicode or to linguistic
>> reality.
>>
>> In Unicode-speak's terminology/jargon, the definition of base
>> character at https://definedterm.com/a/definition/160575 would admit
>> quite a few more, [...]
> The above Is_Basic is about Character, and is defined only when using
> Latin-1. Unicode is a different standard.

Moreover, its definition is historical -- it was defined this way for Ada 
95, and whether or not that would be the correct definition had it been 
defined in 2018 is irrelevant. Changing the definition would potentially 
silently break programs that use it. There are a number of things in 
Ada.Characters.Handling that aren't correct for Unicode purposes, one of 
them is even called out by the third note in A.3.2.

                      Randy.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The extension of Is_Basic to unicode (about AI12-0260-1)
  2018-04-11 22:20     ` Randy Brukardt
@ 2018-04-11 23:57       ` ytomino
  2018-04-12  5:14         ` J-P. Rosen
  0 siblings, 1 reply; 8+ messages in thread
From: ytomino @ 2018-04-11 23:57 UTC (permalink / raw)


On Thursday, April 12, 2018 at 7:20:28 AM UTC+9, Randy Brukardt wrote:
> "J-P. Rosen" <rosen@adalog.fr> wrote in message 
> news:palsmv$g18$1@gioia.aioe.org...
> > Le 11/04/2018 à 16:32, Dan'l Miller a écrit :
> >>> True if Item is a basic letter. A basic letter is a character that
> >>> is in one of the ranges 'A'..'Z' and 'a'..'z', or that is one of
> >>> the following: 'Æ', 'æ', 'Ğ', 'ğ', 'Ş', 'ş', or 'ß'.
> >> If this Ada-specific definition of this is-basic/base-Latin-letter
> >> property is the official normative list, then it seems rather
> >> arbitrary and capricious, not conforming to Unicode or to linguistic
> >> reality.
> >>
> >> In Unicode-speak's terminology/jargon, the definition of base
> >> character at https://definedterm.com/a/definition/160575 would admit
> >> quite a few more, [...]
> > The above Is_Basic is about Character, and is defined only when using
> > Latin-1. Unicode is a different standard.
> 
> Moreover, its definition is historical -- it was defined this way for Ada 
> 95, and whether or not that would be the correct definition had it been 
> defined in 2018 is irrelevant. Changing the definition would potentially 
> silently break programs that use it. There are a number of things in 
> Ada.Characters.Handling that aren't correct for Unicode purposes, one of 
> them is even called out by the third note in A.3.2.
> 
>                       Randy.

Thanks for your detailed description.

If Character.Handling.Is_Basic can not be changed because compatibility, still more, this *overloading* will create new problem for the future.

For example, on rewriting some applications from Character to Wide_Character, it may be imagined that two meanings of Is_Basic will confuse.
Or, they makes hard to use "use clause", or use as a generic formal subprogram.

Excuse me for repeating, should new function name be used for new definition?

  function Is_Base (Item : Wide_Character) return Boolean; -- according with Unicode
  function Is_Basic (Item : Wide_Character) return Boolean is (Is_Base (Item) and Is_Letter (Item)); -- for compatibility

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The extension of Is_Basic to unicode (about AI12-0260-1)
  2018-04-11 23:57       ` ytomino
@ 2018-04-12  5:14         ` J-P. Rosen
  0 siblings, 0 replies; 8+ messages in thread
From: J-P. Rosen @ 2018-04-12  5:14 UTC (permalink / raw)


Le 12/04/2018 à 01:57, ytomino a écrit :
> If Character.Handling.Is_Basic can not be changed because
> compatibility, still more, this *overloading* will create new problem
> for the future.
> 
> For example, on rewriting some applications from Character to
> Wide_Character, it may be imagined that two meanings of Is_Basic will
> confuse.
> Or, they makes hard to use "use clause", or use as a generic formal
> subprogram.
If you are adapting a program to use the full BMP instead of Latin1,
expect many more difficult issues and/or incompatibilities than this one...

> Excuse me for repeating, should new function name be used for new
> definition?
> 
>   function Is_Base (Item : Wide_Character) return Boolean; -- according with Unicode
>   function Is_Basic (Item : Wide_Character) return Boolean is (Is_Base (Item) and Is_Letter (Item)); -- for compatibility

This is technically doable, but not obviously desirable. This
incompatibility seems to me to be a candidate to the following comment,
already used for some other incompatibilities:

"This incompatibility is likely to fix more bugs than it will create"

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-04-12  5:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-11  0:52 The extension of Is_Basic to unicode (about AI12-0260-1) ytomino
2018-04-11  3:38 ` J-P. Rosen
2018-04-11  3:52   ` ytomino
2018-04-11 14:32 ` Dan'l Miller
2018-04-11 20:54   ` J-P. Rosen
2018-04-11 22:20     ` Randy Brukardt
2018-04-11 23:57       ` ytomino
2018-04-12  5:14         ` J-P. Rosen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox