From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: a07f3367d7,8ea33c39efc56ac3
X-Google-Attributes: gida07f3367d7,public,usenet
X-Google-NewGroupId: yes
X-Google-Language: ENGLISH,UTF8
Path: 
 g2news1.google.com!news3.google.com!feeder.news-service.com!news.albasani.net!newsfeed.straub-nv.de!noris.net!newsfeed.arcor.de!newsspool4.arcor-online.net!news.arcor.de.POSTED!not-for-mail
Date: Wed, 12 Oct 2011 20:24:33 +0200
From: Georg Bauhaus <rm.dash-bauhaus@futureapps.de>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6;
 rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: sharp =?UTF-8?B?w58gYW5kIHNzIGluIEFkYSBrZXl3b3JkcyBsaWtlIEFD?=
 =?UTF-8?B?Q0VTUw==?=
References: <4e931db5$0$6541$9b4e6d93@newsspool4.arcor-online.net>
 <1f9a5099-f5f5-49a8-8773-b7eaca771427@s5g2000pra.googlegroups.com>
 <4e93381d$0$6545$9b4e6d93@newsspool4.arcor-online.net>
 <op.v2661evjz25lew@macpro-eth1.krischik.com>
 <4e959011$0$6627$9b4e6d93@newsspool2.arcor-online.net>
 <4r1gqrovnlyw$.u64367deu6pt$.dlg@40tude.net>
In-Reply-To: <4r1gqrovnlyw$.u64367deu6pt$.dlg@40tude.net>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Message-ID: <4e95db62$0$6554$9b4e6d93@newsspool4.arcor-online.net>
Organization: Arcor
NNTP-Posting-Date: 12 Oct 2011 20:24:34 CEST
NNTP-Posting-Host: 16c79ac8.newsspool4.arcor-online.net
X-Trace: 
 DXC=E;C2`813cOPYI9]OHn9o5^4IUK<Cl32<Q4Fo<]lROoRQ8kF<OcfhCO[nJn2oIPGFQRnc\616M64>ZLh>_cHTX3j]F7@<bnc@eZR
X-Complaints-To: usenet-abuse@arcor.de
Xref: g2news1.google.com comp.lang.ada:21395
Date: 2011-10-12T20:24:34+02:00
List-Id: <comp.lang.ada>

On 12.10.11 15:48, Dmitry A. Kazakov wrote:
> On Wed, 12 Oct 2011 15:03:13 +0200, Georg Bauhaus wrote:
> 
>> But I imagine a language rule that addresses common sense
>> more than it does the mechanics of Unicode or the history
>> of writing; it might even be easy to implement:
> 
> Speaking of common sense one should simply drop ß and all other letters not
> present in 7-bit ASCII.

(Why character case? Let's save bits by dropping small letters. ;-)

> If ß=ss, then sch=sh, when matching two
> simple names of different alphabets. How are you going to tag names?
> 
>    German#acceß#  
>    US#access#
> 
> (:-))

The "alphabet" of both "access" and "acceß" (Horrible!) shall
be "Latin", see below.  Thus "access" is not Greek, and
"acceβ" will be an error, because it mixes two "alphabets",
Latin and Greek. The compiler will detected the syntax error.
The same will be true of "AССESS" or "'Rаnge", both being syntax
errors:

$ echo "AССESS" "'Rаnge" |od -c
0000000    A   С  **   С  **   E   S   S       '   R   а  **   n   g   e


Syntax errors are easily detected. The compiler can report
them very clearly:
E: The word "AССESS" uses characters from more than one alphabet

>> Presuming some practical definition of "alphabet".
> 
> For example?

I'd try a KISS definition of "alphabet". It does not involve
national languages, or meaning.

- Latin characters
- Cyrillic characters
- Greek characters
- Arabic (including Farsi) characters
- Hebrew characters
- Chinese characters (both old style, reformed style)
- Japanese characters; I think the rules might have to be
  a little more picky for Japanese identifiers?
- one of the alphabets used in India where all characters
  must come from a single Unicode group such as Devanagari
  or Gujarati
- Thai, Lao, ... characters
- ...

These groupings operate at some very basic level, they don't
care about the meaning of identifiers.  They ignore national
preferences.  Identifiers may not be in harmony with the
requirements of poetry, then.  But this should be fairly easy
to implement, since it is all about simple sets of characters,
They are not overlapping if one draws on ISO 10646. Hence
unions can be formed, and membership tests are easy.