comp.lang.ada
 help / color / mirror / Atom feed
From: Shark8 <onewingedshark@gmail.com>
Subject: Re: GNAT vs UTF-8 source file names
Date: Thu, 6 Jul 2017 08:18:41 -0700 (PDT)
Date: 2017-07-06T08:18:41-07:00	[thread overview]
Message-ID: <c65d0a6b-8dbb-4222-936f-838438e8d5bd@googlegroups.com> (raw)
In-Reply-To: <ojht0q$srt$1@dont-email.me>

On Tuesday, July 4, 2017 at 11:25:20 PM UTC-6, J-P. Rosen wrote:
> Le 04/07/2017 à 19:30, Shark8 a écrit :
> > This is why I maintain that unicode is crap -- a mistake along the
> > lines of C that will likely take *decades* for the rest of "the
> > industry" / computer science to realize.
> Please don't make such statements until you understand all the issues -
> the problem of character sets is incredibly complicated.

I'm not saying it isn't complicated; I'm saying that it could, and should, have been done better. Instead we get a bizarre Frankenstein's-monster of techniques where some character-glyphs are precomposed (with duplicates across multiple languages) and Zalgo-script is a thing. (see: https://eeemo.net/ )

Not only that, but there's the problem of strings; instead of doing something sensible ("but wasteful"*) by designing a "multilanguage string" that partitioned strings by language. Ex:

Type Language is (English, French, Russian); -- supported languages

Type Discriminated_String( Words : Language; Length : Natural ) is record
  Data : String(1..Length); -- Sequence of code-points/characters.
end record;

Package Discriminated_String_Vector is new Ada.Containers.Indefinite_Vector
  ( Index_Type => Positive, Element_Type => Discriminated_String );


Type Multi_Language_String is new Discriminated_String_Vector.Vector with null record;
-- New primitive operations.

And *THERE* you have a sane framework for managing multilingual text; granted *most* text would only /need/ a single element vector because most text is not multi-lingual; that's ok. The important part here is that the languages are kept distinct and clearly indicated. (This would also allow far more maintainability than unicode's system because you could then allow independent subgroups to manage their own language.)

> 
> >> I have to say that, great as it would be to have this fixed, the
> >> changes required would be extensive, and I can’t see that anyone
> >> would think it worth the trouble.
> > One of unicode's biggest problems is that there's no longer any
> > coherent vision -- it started off as a idea to offer one code-point
> > per character in human language, but then shifted to glyph-building
> > (hence combining characters), and as such lacks a unifying
> > principle.
> The unifying principle is the normalization forms. The fact that there
> are several normalization forms comes from the difference between human
> and computer needs.

Perhaps so, but there ought to be a way to identify such a context rather than just throwing these normalized forms in the UTF-string blender, shrugging, and handing it off to the programmers as "not my problem".

I mean as a counter-example ASN.1 has normalizing encodings like DER and CER, but these are (a) usually distinguished by being defined by their particular encoding, and when they aren't (b) are proper subsets of BER. [Much like subtypes in Ada and how we can use Natural & Positive for better describing our problem, but can use Integer when needed (ie foreign interfacing where the constraint might not be guarenteed).]


* -- Wasteful like keeping the bounds of an array seems wasteful to C programmers.

  reply	other threads:[~2017-07-06 15:18 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-30 17:10 GNAT vs UTF-8 source file names Simon Wright
2017-06-17 17:20 ` Simon Wright
2017-06-27 13:22   ` Jacob Sparre Andersen
2017-06-27 21:45     ` Niklas Holsti
2017-06-28  5:05       ` G.B.
2017-07-04 13:57   ` Simon Wright
2017-07-04 17:30     ` Shark8
2017-07-04 18:08       ` Dennis Lee Bieber
2017-07-05  5:25       ` J-P. Rosen
2017-07-06 15:18         ` Shark8 [this message]
2017-07-07  8:19           ` J-P. Rosen
2017-07-05  5:21     ` J-P. Rosen
2017-07-05  9:47       ` Simon Wright
2017-07-05 11:20         ` J-P. Rosen
2017-07-05 18:42           ` Randy Brukardt
2017-07-06 18:43           ` Simon Wright
2017-07-07  8:26             ` J-P. Rosen
2017-07-07 11:01               ` Simon Wright
2017-07-07 11:49                 ` Jacob Sparre Andersen
2017-07-07 19:44                   ` Randy Brukardt
2017-07-07 19:40                 ` Randy Brukardt
2017-07-07 21:02                   ` Simon Wright
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox