From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.107.26.76 with SMTP id a73mr529096ioa.75.1499354321994;
        Thu, 06 Jul 2017 08:18:41 -0700 (PDT)
X-Received: by 10.36.69.103 with SMTP id y100mr32918ita.0.1499354321942; Thu,
 06 Jul 2017 08:18:41 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!paganini.bofh.team!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!188no501521itx.0!news-out.google.com!f200ni262itc.0!nntp.google.com!v202no504753itb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Thu, 6 Jul 2017 08:18:41 -0700 (PDT)
In-Reply-To: <ojht0q$srt$1@dont-email.me>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=76.113.92.25;
 posting-account=lJ3JNwoAAAAQfH3VV9vttJLkThaxtTfC
NNTP-Posting-Host: 76.113.92.25
References: <lytw55kei5.fsf@pushface.org> <lyefuia5ur.fsf@pushface.org>
 <lyeftw2tlc.fsf@pushface.org>
 <b4c7079c-8c00-4c7f-938f-87f031172923@googlegroups.com>
 <ojht0q$srt$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c65d0a6b-8dbb-4222-936f-838438e8d5bd@googlegroups.com>
Subject: Re: GNAT vs UTF-8 source file names
From: Shark8 <onewingedshark@gmail.com>
Injection-Date: Thu, 06 Jul 2017 15:18:41 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Xref: news.eternal-september.org comp.lang.ada:47305
Date: 2017-07-06T08:18:41-07:00
List-Id: <comp.lang.ada>

On Tuesday, July 4, 2017 at 11:25:20 PM UTC-6, J-P. Rosen wrote:
> Le 04/07/2017 =C3=A0 19:30, Shark8 a =C3=A9crit :
> > This is why I maintain that unicode is crap -- a mistake along the
> > lines of C that will likely take *decades* for the rest of "the
> > industry" / computer science to realize.
> Please don't make such statements until you understand all the issues -
> the problem of character sets is incredibly complicated.

I'm not saying it isn't complicated; I'm saying that it could, and should, =
have been done better. Instead we get a bizarre Frankenstein's-monster of t=
echniques where some character-glyphs are precomposed (with duplicates acro=
ss multiple languages) and Zalgo-script is a thing. (see: https://eeemo.net=
/ )

Not only that, but there's the problem of strings; instead of doing somethi=
ng sensible ("but wasteful"*) by designing a "multilanguage string" that pa=
rtitioned strings by language. Ex:

Type Language is (English, French, Russian); -- supported languages

Type Discriminated_String( Words : Language; Length : Natural ) is record
  Data : String(1..Length); -- Sequence of code-points/characters.
end record;

Package Discriminated_String_Vector is new Ada.Containers.Indefinite_Vector
  ( Index_Type =3D> Positive, Element_Type =3D> Discriminated_String );


Type Multi_Language_String is new Discriminated_String_Vector.Vector with n=
ull record;
-- New primitive operations.

And *THERE* you have a sane framework for managing multilingual text; grant=
ed *most* text would only /need/ a single element vector because most text =
is not multi-lingual; that's ok. The important part here is that the langua=
ges are kept distinct and clearly indicated. (This would also allow far mor=
e maintainability than unicode's system because you could then allow indepe=
ndent subgroups to manage their own language.)

>=20
> >> I have to say that, great as it would be to have this fixed, the
> >> changes required would be extensive, and I can=E2=80=99t see that anyo=
ne
> >> would think it worth the trouble.
> > One of unicode's biggest problems is that there's no longer any
> > coherent vision -- it started off as a idea to offer one code-point
> > per character in human language, but then shifted to glyph-building
> > (hence combining characters), and as such lacks a unifying
> > principle.
> The unifying principle is the normalization forms. The fact that there
> are several normalization forms comes from the difference between human
> and computer needs.

Perhaps so, but there ought to be a way to identify such a context rather t=
han just throwing these normalized forms in the UTF-string blender, shruggi=
ng, and handing it off to the programmers as "not my problem".

I mean as a counter-example ASN.1 has normalizing encodings like DER and CE=
R, but these are (a) usually distinguished by being defined by their partic=
ular encoding, and when they aren't (b) are proper subsets of BER. [Much li=
ke subtypes in Ada and how we can use Natural & Positive for better describ=
ing our problem, but can use Integer when needed (ie foreign interfacing wh=
ere the constraint might not be guarenteed).]


* -- Wasteful like keeping the bounds of an array seems wasteful to C progr=
ammers.