On Thu, 28 Dec 2006 18:35:06 +0100, Georg Bauhaus wrote:

> GNAT already supports the detection of identifiers that were
 > spelled similarly. In case of errors, it lists their "relatives".
> Surely a helpful feature, and a proof that practical handling of
> natural language identifiers is possible.
> As an example, as you have been referring to German, consider that
> sharp s, '�', is usually written "SS" when capitalized.
> So "Stra�e" tends to become "STRASSE". Now if you have a composite
> word that has
> - a '�', and
> - an 's' right after it,
> such as "Ma�stab" (= scale, rule, yardstick), then from a simple
> minded formalist's perspective I could argue:
>
>   "Using Unicode is nonsense because there is no 1:1 mapping for the
>   German word 'Ma�stab' which will become 'MASSSTAB'. "SSS" is
>   ambiguous, it could be "s�" or it could be "�s". That's too big
>   a challenge for a compiler write. So leave me alone with your
>   Unicode and case insensitivity."
>
> Is that what computer science has to answer when asked about
> characters handling?

For what it's worth, Ada says that all three of these represent the same
identifier. That's not ideal, but it's the best that we can do without
dropping into the character handling mess ourselves.

This is even more interesting when you consider that there are alternative
spellings for reserved words. For instance "acce�" is identical to "access".
(See 2.3(5.c/2) in the AARM for more examples). We wrestled with that quite
a while before deciding that such identifiers had to be illegal
(2.3(5.3/2)); we didn't want them appearing in programs in place of reserved
words.

                                        Randy.