From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Simon Wright <simon@pushface.org>
Newsgroups: comp.lang.ada
Subject: Re: GNAT vs UTF-8 source file names
Date: Wed, 05 Jul 2017 10:47:39 +0100
Organization: A noiseless patient Spider
Message-ID: <ly60f72p1g.fsf@pushface.org>
References: <lytw55kei5.fsf@pushface.org> <lyefuia5ur.fsf@pushface.org>
	<lyeftw2tlc.fsf@pushface.org> <ojhspu$sb2$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: mx02.eternal-september.org;
 posting-host="1fd775f133d24a665d85179a3c87631c";
	logging-data="31657"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX196eoN/MmeAucTAeTXsE77JDAd0Fsp5DwQ="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (darwin)
Cancel-Lock: sha1:4QvMALPFd1yT7zhK58w4xB/w/U4=
	sha1:GjxY5jv66zFI8/wN55+4PAayR2o=
Xref: news.eternal-september.org comp.lang.ada:47297
Date: 2017-07-05T10:47:39+01:00
List-Id: <comp.lang.ada>

"J-P. Rosen" <rosen@adalog.fr> writes:

> Le 04/07/2017 à 15:57, Simon Wright a écrit :
>> The reason for this apparently-bizarre message is[3] that macOS takes
>> the composed form (lowercase a acute) and converts it under the hood
>> to what HFS+ insists on, the fully decomposed form (lowercase a,
>> combining acute); thus the names are actually different even though
>> they _look_ the same.
> Apparently, they use NFD (Normalization Form D). Normalization forms
> are necessary to avoid a whole lot of problems, although Ada requires
> normalization form C (ARM 2.1 (4.1/3)), or more precisely, it is
> implementation defined if the text is not in NFC.

That reference specifies NFKC which I suppose is near! GNAT uses this if
either you compile with -gnatW8 or the file begins with a UTF8 BOM.

The problems I've noted in this thread in the GNAT implementation are
two:

(1) On Windows and macOS (and possibly on VMS, not sure if that's
relevant any more) the file name corresponding to a unit name is
converted to lower-case assuming it's Latin-1 -
System.Case_Util.To_Lower,

   function To_Lower (A : Character) return Character is
      A_Val : constant Natural := Character'Pos (A);

   begin
      if A in 'A' .. 'Z'
        or else A_Val in 16#C0# .. 16#D6#
        or else A_Val in 16#D8# .. 16#DE#
      then
         return Character'Val (A_Val + 16#20#);
      else
         return A;
      end if;
   end To_Lower;

This is the problem that prevents use of extended characters in unit
names.

(2) On macOS, the expected file name appears to be stored in NFC, but is
retrieved from the file system in NFD.

It seems this will only cause a problem if you compile the file (on its
own, not as part of the closure of another file - weird - possibly
because the wildcard picks up the NFD representation, while compiling as
part of the closure uses the NFC representation in the ALI?) with -gnatwe:

$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c -f p*.ads -gnatwe
gcc -c -gnatwe páck3.ads
páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"
gnatmake: "páck3.ads" compilation error

(this message was copied from Terminal and pasted into Emacs, which
makes clear the difference between the two representations; previously
I've copied from Terminal and pasted into Safari/Bugzilla, which
produced identical glyphs).