From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: "J-P. Rosen" <rosen@adalog.fr>
Newsgroups: comp.lang.ada
Subject: Re: GNAT vs UTF-8 source file names
Date: Fri, 7 Jul 2017 10:26:08 +0200
Organization: A noiseless patient Spider
Message-ID: <ojngbr$rcp$1@dont-email.me>
References: <lytw55kei5.fsf@pushface.org> <lyefuia5ur.fsf@pushface.org>
 <lyeftw2tlc.fsf@pushface.org> <ojhspu$sb2$1@dont-email.me>
 <ly60f72p1g.fsf@pushface.org> <ojihrl$qu2$1@dont-email.me>
 <lyfue91k4a.fsf@pushface.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 7 Jul 2017 08:22:19 -0000 (UTC)
Injection-Info: mx02.eternal-september.org;
 posting-host="d52e56542d0c212d612845daa3d7c429";
	logging-data="28057"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX19kOQnqXrmt2gOXgl/AesLd"
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
In-Reply-To: <lyfue91k4a.fsf@pushface.org>
Content-Language: fr
Cancel-Lock: sha1:4ocLsQIkb9Fr3WVOWPBkuJK6LXs=
Xref: news.eternal-september.org comp.lang.ada:47309
Date: 2017-07-07T10:26:08+02:00
List-Id: <comp.lang.ada>

Le 06/07/2017 à 20:43, Simon Wright a écrit :
>> Even if you use Latin-1, the set of allowed characters is defined as
>> those that belong to NFKC.
> I don't understand.
> 
> If your source has no BOM and you don't say -gnatW8, GNAT expects
> Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT
> expects UTF8 encoding (I haven't tried what happens if you use NFD).
> 
> I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the
> use in unit names - ARM 2.1(16) says it should be accepted.
> 
> (later) UTF8 is accepted in strings but not in identifiers.

This is a common confusion between characters, coded sets, and encodings...

ISO-10646 defines a coded set (code points) for a number of characters
(identical to the one defined by Unicode). Some of these characters can
be represented in NFKC. These are the allowed characters.

If you use Latin-1, you have different code points for the same
characters - and the allowed characters are still those representable in
NFKC, even with different code points.

UTF8 is an encoding, nothing more than a compression algorithm for
numerical values. It is generally used to compress Unicode strings, but
could be used for any numerical values. In any case, it doesn't change
logical values, just the way they are stored.


-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr