From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "J-P. Rosen" Newsgroups: comp.lang.ada Subject: Re: GNAT vs UTF-8 source file names Date: Fri, 7 Jul 2017 10:26:08 +0200 Organization: A noiseless patient Spider Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Fri, 7 Jul 2017 08:22:19 -0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="d52e56542d0c212d612845daa3d7c429"; logging-data="28057"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19kOQnqXrmt2gOXgl/AesLd" User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 In-Reply-To: Content-Language: fr Cancel-Lock: sha1:4ocLsQIkb9Fr3WVOWPBkuJK6LXs= Xref: news.eternal-september.org comp.lang.ada:47309 Date: 2017-07-07T10:26:08+02:00 List-Id: Le 06/07/2017 à 20:43, Simon Wright a écrit : >> Even if you use Latin-1, the set of allowed characters is defined as >> those that belong to NFKC. > I don't understand. > > If your source has no BOM and you don't say -gnatW8, GNAT expects > Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT > expects UTF8 encoding (I haven't tried what happens if you use NFD). > > I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the > use in unit names - ARM 2.1(16) says it should be accepted. > > (later) UTF8 is accepted in strings but not in identifiers. This is a common confusion between characters, coded sets, and encodings... ISO-10646 defines a coded set (code points) for a number of characters (identical to the one defined by Unicode). Some of these characters can be represented in NFKC. These are the allowed characters. If you use Latin-1, you have different code points for the same characters - and the allowed characters are still those representable in NFKC, even with different code points. UTF8 is an encoding, nothing more than a compression algorithm for numerical values. It is generally used to compress Unicode strings, but could be used for any numerical values. In any case, it doesn't change logical values, just the way they are stored. -- J-P. Rosen Adalog 2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00 http://www.adalog.fr