comp.lang.ada
 help / color / mirror / Atom feed
From: Simon Wright <simon@pushface.org>
Subject: Re: GNAT vs UTF-8 source file names
Date: Fri, 07 Jul 2017 12:01:19 +0100
Date: 2017-07-07T12:01:19+01:00	[thread overview]
Message-ID: <lybmow1pfk.fsf@pushface.org> (raw)
In-Reply-To: ojngbr$rcp$1@dont-email.me

"J-P. Rosen" <rosen@adalog.fr> writes:

> Le 06/07/2017 à 20:43, Simon Wright a écrit :
>>> Even if you use Latin-1, the set of allowed characters is defined as
>>> those that belong to NFKC.
>> I don't understand.
>> 
>> If your source has no BOM and you don't say -gnatW8, GNAT expects
>> Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT
>> expects UTF8 encoding (I haven't tried what happens if you use NFD).
>> 
>> I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the
>> use in unit names - ARM 2.1(16) says it should be accepted.
>> 
>> (later) UTF8 is accepted in strings but not in identifiers.
>
> This is a common confusion between characters, coded sets, and encodings...
>
> ISO-10646 defines a coded set (code points) for a number of characters
> (identical to the one defined by Unicode). Some of these characters can
> be represented in NFKC. These are the allowed characters.
>
> If you use Latin-1, you have different code points for the same
> characters - and the allowed characters are still those representable in
> NFKC, even with different code points.
>
> UTF8 is an encoding, nothing more than a compression algorithm for
> numerical values. It is generally used to compress Unicode strings, but
> could be used for any numerical values. In any case, it doesn't change
> logical values, just the way they are stored.

I think this is a response to my "I don't understand" - I think I do
understand a little better now, thank you.

The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says

   "An Ada implementation shall accept Ada source code in UTF-8
   encoding, with or without a BOM (see A.4.11), where every character
   is represented by its code point."

which for GNAT is not met unless either there is a BOM or -gnatW8 is
used.

On the other hand, ARM 2.1(4/3) says "The coded representation for
characters is implementation defined", which seems to conflict with (16)
- but then, the AARM ramification (4.b/2) notes that the rule doesn't
have much force!


  reply	other threads:[~2017-07-07 11:01 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-30 17:10 GNAT vs UTF-8 source file names Simon Wright
2017-06-17 17:20 ` Simon Wright
2017-06-27 13:22   ` Jacob Sparre Andersen
2017-06-27 21:45     ` Niklas Holsti
2017-06-28  5:05       ` G.B.
2017-07-04 13:57   ` Simon Wright
2017-07-04 17:30     ` Shark8
2017-07-04 18:08       ` Dennis Lee Bieber
2017-07-05  5:25       ` J-P. Rosen
2017-07-06 15:18         ` Shark8
2017-07-07  8:19           ` J-P. Rosen
2017-07-05  5:21     ` J-P. Rosen
2017-07-05  9:47       ` Simon Wright
2017-07-05 11:20         ` J-P. Rosen
2017-07-05 18:42           ` Randy Brukardt
2017-07-06 18:43           ` Simon Wright
2017-07-07  8:26             ` J-P. Rosen
2017-07-07 11:01               ` Simon Wright [this message]
2017-07-07 11:49                 ` Jacob Sparre Andersen
2017-07-07 19:44                   ` Randy Brukardt
2017-07-07 19:40                 ` Randy Brukardt
2017-07-07 21:02                   ` Simon Wright
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox