From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Simon Wright Newsgroups: comp.lang.ada Subject: Re: GNAT vs UTF-8 source file names Date: Fri, 07 Jul 2017 12:01:19 +0100 Organization: A noiseless patient Spider Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: mx02.eternal-september.org; posting-host="92a1e7c5b06125805561664626b58e07"; logging-data="28532"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18JLjy3BmvRFISMBWo9/TcYu+WmSX5G1lI=" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (darwin) Cancel-Lock: sha1:Ci5Do03BrXMit0YQ3VIbU6OFU6I= sha1:uY76SfBAGiclHZQdMmgL0JZYLiw= Xref: news.eternal-september.org comp.lang.ada:47311 Date: 2017-07-07T12:01:19+01:00 List-Id: "J-P. Rosen" writes: > Le 06/07/2017 à 20:43, Simon Wright a écrit : >>> Even if you use Latin-1, the set of allowed characters is defined as >>> those that belong to NFKC. >> I don't understand. >> >> If your source has no BOM and you don't say -gnatW8, GNAT expects >> Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT >> expects UTF8 encoding (I haven't tried what happens if you use NFD). >> >> I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the >> use in unit names - ARM 2.1(16) says it should be accepted. >> >> (later) UTF8 is accepted in strings but not in identifiers. > > This is a common confusion between characters, coded sets, and encodings... > > ISO-10646 defines a coded set (code points) for a number of characters > (identical to the one defined by Unicode). Some of these characters can > be represented in NFKC. These are the allowed characters. > > If you use Latin-1, you have different code points for the same > characters - and the allowed characters are still those representable in > NFKC, even with different code points. > > UTF8 is an encoding, nothing more than a compression algorithm for > numerical values. It is generally used to compress Unicode strings, but > could be used for any numerical values. In any case, it doesn't change > logical values, just the way they are stored. I think this is a response to my "I don't understand" - I think I do understand a little better now, thank you. The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says "An Ada implementation shall accept Ada source code in UTF-8 encoding, with or without a BOM (see A.4.11), where every character is represented by its code point." which for GNAT is not met unless either there is a BOM or -gnatW8 is used. On the other hand, ARM 2.1(4/3) says "The coded representation for characters is implementation defined", which seems to conflict with (16) - but then, the AARM ramification (4.b/2) notes that the rule doesn't have much force!