From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Simon Wright Newsgroups: comp.lang.ada Subject: Re: GNAT vs UTF-8 source file names Date: Wed, 05 Jul 2017 10:47:39 +0100 Organization: A noiseless patient Spider Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: mx02.eternal-september.org; posting-host="1fd775f133d24a665d85179a3c87631c"; logging-data="31657"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX196eoN/MmeAucTAeTXsE77JDAd0Fsp5DwQ=" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (darwin) Cancel-Lock: sha1:4QvMALPFd1yT7zhK58w4xB/w/U4= sha1:GjxY5jv66zFI8/wN55+4PAayR2o= Xref: news.eternal-september.org comp.lang.ada:47297 Date: 2017-07-05T10:47:39+01:00 List-Id: "J-P. Rosen" writes: > Le 04/07/2017 à 15:57, Simon Wright a écrit : >> The reason for this apparently-bizarre message is[3] that macOS takes >> the composed form (lowercase a acute) and converts it under the hood >> to what HFS+ insists on, the fully decomposed form (lowercase a, >> combining acute); thus the names are actually different even though >> they _look_ the same. > Apparently, they use NFD (Normalization Form D). Normalization forms > are necessary to avoid a whole lot of problems, although Ada requires > normalization form C (ARM 2.1 (4.1/3)), or more precisely, it is > implementation defined if the text is not in NFC. That reference specifies NFKC which I suppose is near! GNAT uses this if either you compile with -gnatW8 or the file begins with a UTF8 BOM. The problems I've noted in this thread in the GNAT implementation are two: (1) On Windows and macOS (and possibly on VMS, not sure if that's relevant any more) the file name corresponding to a unit name is converted to lower-case assuming it's Latin-1 - System.Case_Util.To_Lower, function To_Lower (A : Character) return Character is A_Val : constant Natural := Character'Pos (A); begin if A in 'A' .. 'Z' or else A_Val in 16#C0# .. 16#D6# or else A_Val in 16#D8# .. 16#DE# then return Character'Val (A_Val + 16#20#); else return A; end if; end To_Lower; This is the problem that prevents use of extended characters in unit names. (2) On macOS, the expected file name appears to be stored in NFC, but is retrieved from the file system in NFD. It seems this will only cause a problem if you compile the file (on its own, not as part of the closure of another file - weird - possibly because the wildcard picks up the NFD representation, while compiling as part of the closure uses the NFC representation in the ALI?) with -gnatwe: $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c -f p*.ads -gnatwe gcc -c -gnatwe páck3.ads páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads" gnatmake: "páck3.ads" compilation error (this message was copied from Terminal and pasted into Emacs, which makes clear the difference between the two representations; previously I've copied from Terminal and pasted into Safari/Bugzilla, which produced identical glyphs).