From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: a07f3367d7,8ea33c39efc56ac3 X-Google-Attributes: gida07f3367d7,public,usenet X-Google-NewGroupId: yes X-Google-Language: ENGLISH,UTF8 Path: g2news1.google.com!news3.google.com!feeder.news-service.com!news.albasani.net!newsfeed.straub-nv.de!noris.net!newsfeed.arcor.de!newsspool4.arcor-online.net!news.arcor.de.POSTED!not-for-mail Date: Wed, 12 Oct 2011 20:24:33 +0200 From: Georg Bauhaus User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: sharp =?UTF-8?B?w58gYW5kIHNzIGluIEFkYSBrZXl3b3JkcyBsaWtlIEFD?= =?UTF-8?B?Q0VTUw==?= References: <4e931db5$0$6541$9b4e6d93@newsspool4.arcor-online.net> <1f9a5099-f5f5-49a8-8773-b7eaca771427@s5g2000pra.googlegroups.com> <4e93381d$0$6545$9b4e6d93@newsspool4.arcor-online.net> <4e959011$0$6627$9b4e6d93@newsspool2.arcor-online.net> <4r1gqrovnlyw$.u64367deu6pt$.dlg@40tude.net> In-Reply-To: <4r1gqrovnlyw$.u64367deu6pt$.dlg@40tude.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Message-ID: <4e95db62$0$6554$9b4e6d93@newsspool4.arcor-online.net> Organization: Arcor NNTP-Posting-Date: 12 Oct 2011 20:24:34 CEST NNTP-Posting-Host: 16c79ac8.newsspool4.arcor-online.net X-Trace: DXC=E;C2`813cOPYI9]OHn9o5^4IUKZLh>_cHTX3j]F7@ On 12.10.11 15:48, Dmitry A. Kazakov wrote: > On Wed, 12 Oct 2011 15:03:13 +0200, Georg Bauhaus wrote: > >> But I imagine a language rule that addresses common sense >> more than it does the mechanics of Unicode or the history >> of writing; it might even be easy to implement: > > Speaking of common sense one should simply drop ß and all other letters not > present in 7-bit ASCII. (Why character case? Let's save bits by dropping small letters. ;-) > If ß=ss, then sch=sh, when matching two > simple names of different alphabets. How are you going to tag names? > > German#acceß# > US#access# > > (:-)) The "alphabet" of both "access" and "acceß" (Horrible!) shall be "Latin", see below. Thus "access" is not Greek, and "acceβ" will be an error, because it mixes two "alphabets", Latin and Greek. The compiler will detected the syntax error. The same will be true of "AССESS" or "'Rаnge", both being syntax errors: $ echo "AССESS" "'Rаnge" |od -c 0000000 A С ** С ** E S S ' R а ** n g e Syntax errors are easily detected. The compiler can report them very clearly: E: The word "AССESS" uses characters from more than one alphabet >> Presuming some practical definition of "alphabet". > > For example? I'd try a KISS definition of "alphabet". It does not involve national languages, or meaning. - Latin characters - Cyrillic characters - Greek characters - Arabic (including Farsi) characters - Hebrew characters - Chinese characters (both old style, reformed style) - Japanese characters; I think the rules might have to be a little more picky for Japanese identifiers? - one of the alphabets used in India where all characters must come from a single Unicode group such as Devanagari or Gujarati - Thai, Lao, ... characters - ... These groupings operate at some very basic level, they don't care about the meaning of identifiers. They ignore national preferences. Identifiers may not be in harmony with the requirements of poetry, then. But this should be fairly easy to implement, since it is all about simple sets of characters, They are not overlapping if one draws on ISO 10646. Hence unions can be formed, and membership tests are easy.