GNAT vs UTF-8 source file names

comp.lang.ada
 help / color / mirror / Atom feed

* GNAT vs UTF-8 source file names
@ 2017-04-30 17:10 Simon Wright
  2017-06-17 17:20 ` Simon Wright
  0 siblings, 1 reply; 22+ messages in thread
From: Simon Wright @ 2017-04-30 17:10 UTC (permalink / raw)


ACATS 4.1 test C250002 involves unit names with UTF-8 characters (the
source has the correct UTF-8 BOM, the relevant unit is named C250002_Z
where Z is actually UTF-8 C381, latin capital letter a with acute;
gnatchop correctly generates a source file with the BOM and name
c250002_z where z is actually UTF-8 C3A1, latin small letter a with
acute).

On compiling, the compiler (GNAT GPL 2016, FSF GCC 7.0.1) fails to find
the file; it says e.g.

   GNATMAKE GPL 2016 (20160515-49)
   Copyright (C) 1992-2016, Free Software Foundation, Inc.
   gcc -c -I../../../support -gnatW8 c250002.adb
   gcc -c -I../../../support -gnatW8 c250002_0.ads
   End of compilation
   gnatmake: "c250002_?.adb" not found

I _suspect_ that the problem is down to the .ali file. macOS says

   $ file -I *
   c250002.adb:   text/plain; charset=utf-8
   c250002.ali:   text/plain; charset=unknown-8bit
   c250002.lst:   text/plain; charset=us-ascii
   c250002.o:     application/x-mach-binary; charset=binary
   c250002_0.ads: text/plain; charset=utf-8
   c250002_á.adb: text/plain; charset=utf-8
   c250002_á.ads: text/plain; charset=utf-8

(the last 2 were actually a-acute on the terminal) but the .ali file is
confused about whether the representation of the a-acute is C3A1 (good,
assuming it gets interpreted as UTF-8 without a BOM) or E3A1 (bad),
particularly about the corresponding .ali file name.

Any thoughts? is this a known issue?

(C250001, which has BOMs and UTF-8 identifiers but not file names, works fine
with no -gnatW8 messing)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-04-30 17:10 GNAT vs UTF-8 source file names Simon Wright
@ 2017-06-17 17:20 ` Simon Wright
  2017-06-27 13:22   ` Jacob Sparre Andersen
  2017-07-04 13:57   ` Simon Wright
  0 siblings, 2 replies; 22+ messages in thread
From: Simon Wright @ 2017-06-17 17:20 UTC (permalink / raw)

Simon Wright <simon@pushface.org> writes:

> ACATS 4.1 test C250002 involves unit names with UTF-8 characters (the
> source has the correct UTF-8 BOM, the relevant unit is named C250002_Z
> where Z is actually UTF-8 C381, latin capital letter a with acute;
> gnatchop correctly generates a source file with the BOM and name
> c250002_z where z is actually UTF-8 C3A1, latin small letter a with
> acute).
>
> On compiling, the compiler (GNAT GPL 2016, FSF GCC 7.0.1) fails to find
> the file; it says e.g.
>
>    GNATMAKE GPL 2016 (20160515-49)
>    Copyright (C) 1992-2016, Free Software Foundation, Inc.
>    gcc -c -I../../../support -gnatW8 c250002.adb
>    gcc -c -I../../../support -gnatW8 c250002_0.ads
>    End of compilation
>    gnatmake: "c250002_?.adb" not found

PR ada/81114 refers[1].

It turns out that this failure occurs on Windows and macOS. The problem
is that GNAT smashes the file name to lower case if it knows that the
file system is case-insensitive (using an ASCII to-lower, so of course
'smash' is the right word if there are UTF-8 characters in there).

There is an undocumented environment variable that affects this:

   $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake c250002
   gcc -c c250002.adb
   gcc -c c250002_á.adb
   gnatbind -x c250002.ali
   gnatlink c250002.ali
   $ ./c250002

   ,.,. C250002 ACATS 4.1 17-06-17 18:05:55
   ---- C250002 Check that characters above ASCII.Del can be used in
                   identifiers, character literals and strings.
      - C250002 C250002_0.TAGGED_Ã _ID.
   ==== C250002 PASSED ============================.

I wonder why, if the FS is case-insensitive, GNAT bothers at all? (there
was, I think, some remark about detecting whether two filenames
represented different files).

What do people who actually need to use international character sets do
about this? Do you just avoid using international characters in Ada unit
names? Or have I just missed the relevant part of the manual?

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-06-17 17:20 ` Simon Wright
@ 2017-06-27 13:22   ` Jacob Sparre Andersen
  2017-06-27 21:45     ` Niklas Holsti
  2017-07-04 13:57   ` Simon Wright
  1 sibling, 1 reply; 22+ messages in thread
From: Jacob Sparre Andersen @ 2017-06-27 13:22 UTC (permalink / raw)

Simon Wright wrote:

> What do people who actually need to use international character sets
> do about this? Do you just avoid using international characters in Ada
> unit names? Or have I just missed the relevant part of the manual?

One of my customers simply has a policy saying that all identifiers have
to be in English (the policy doesn't say if it should be American
English or proper English), and thus neatly works around the problem.

This reminds me tha Jean-Pierre Rosen had a very entertaining tutorial
on glyphs, graphemes, alphabets, characters, character sets, encodings,
etc. at Ada-Europe 2017 in Vienna.  We learnt all kinds of stuff we
really don't want to know and worry about. ;-)

Greetings,

Jacob
-- 
"Even god needs a bus to get there."

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-06-27 13:22   ` Jacob Sparre Andersen
@ 2017-06-27 21:45     ` Niklas Holsti
  2017-06-28  5:05       ` G.B.
  0 siblings, 1 reply; 22+ messages in thread
From: Niklas Holsti @ 2017-06-27 21:45 UTC (permalink / raw)

On 17-06-27 16:22 , Jacob Sparre Andersen wrote:
 > Simon Wright wrote:
 >
 >> What do people who actually need to use international character sets
 >> do about this? Do you just avoid using international characters in
 >> Ada unit names? Or have I just missed the relevant part of the
 >> manual?

I use ISO-Latin-1 identifiers in some Ada programs written in a Finnish 
context, using the Finnish alphabet letters ä, ö, and sometimes the 
Swedish å. Worked OK for me until *some* of the file systems I use 
changed from file names with 8-bit characters to UTF-8 file names, after 
which CVS was quite messed up. I have since limited myself to ASCII in 
all identifiers that become file name parts in GNAT's file-naming 
convention, but I still use ISO Latin 1 for other identifiers.

 > One of my customers simply has a policy saying that all identifiers
 > have to be in English (the policy doesn't say if it should be American
 > English or proper English), and thus neatly works around the problem.

Only if you stick to "modern" English spelling. Otherwise you could 
have, for example,

    package Coördinates is ...

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-06-27 21:45     ` Niklas Holsti
@ 2017-06-28  5:05       ` G.B.
  0 siblings, 0 replies; 22+ messages in thread
From: G.B. @ 2017-06-28  5:05 UTC (permalink / raw)

On 27.06.17 23:45, Niklas Holsti wrote:
>
>> One of my customers simply has a policy saying that all identifiers
>> have to be in English (the policy doesn't say if it should be American
>> English or proper English), and thus neatly works around the problem.
>
> Only if you stick to "modern" English spelling. Otherwise you could have, for example,
>
>    package Coördinates is ...

Just like some might be tempted to use floating point
types when they have permission to use integer types
instead: the support for the more complicated, error
prone, and difficult new floating point type is partially
broken, so, programmers, let us get away with the current
support situation by preferring integer types. They are much
more portable, anyway!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-06-17 17:20 ` Simon Wright
  2017-06-27 13:22   ` Jacob Sparre Andersen
@ 2017-07-04 13:57   ` Simon Wright
  2017-07-04 17:30     ` Shark8
  2017-07-05  5:21     ` J-P. Rosen
  1 sibling, 2 replies; 22+ messages in thread
From: Simon Wright @ 2017-07-04 13:57 UTC (permalink / raw)

Simon Wright <simon@pushface.org> writes:

> PR ada/81114 refers[1].
>
> It turns out that this failure occurs on Windows and macOS. The problem
> is that GNAT smashes the file name to lower case if it knows that the
> file system is case-insensitive (using an ASCII to-lower, so of course
> 'smash' is the right word if there are UTF-8 characters in there).

> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114

It's worse than that, on macOS anyway[2].

$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
gcc -c páck3.ads
páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"

The reason for this apparently-bizarre message is[3] that macOS takes 
the composed form (lowercase a acute) and converts it under the hood 
to what HFS+ insists on, the fully decomposed form (lowercase a, combining 
acute); thus the names are actually different even though they _look_ 
the same.

I have to say that, great as it would be to have this fixed, the changes 
required would be extensive, and I can’t see that anyone would think it 
worth the trouble.

The recommendation would be "don’t use international characters in the 
names of library units".

[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114#c1
[3] https://stackoverflow.com/a/6153713/40851

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-04 13:57   ` Simon Wright
@ 2017-07-04 17:30     ` Shark8
  2017-07-04 18:08       ` Dennis Lee Bieber
  2017-07-05  5:25       ` J-P. Rosen
  2017-07-05  5:21     ` J-P. Rosen
  1 sibling, 2 replies; 22+ messages in thread
From: Shark8 @ 2017-07-04 17:30 UTC (permalink / raw)


On Tuesday, July 4, 2017 at 7:57:06 AM UTC-6, Simon Wright wrote:
> 
> It's worse than that, on macOS anyway[2].
> 
> $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
> gcc -c páck3.ads
> páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"
> 
> The reason for this apparently-bizarre message is[3] that macOS takes 
> the composed form (lowercase a acute) and converts it under the hood 
> to what HFS+ insists on, the fully decomposed form (lowercase a, combining 
> acute); thus the names are actually different even though they _look_ 
> the same.

This is why I maintain that unicode is crap -- a mistake along the lines of C that will likely take *decades* for the rest of "the industry" / computer science to realize.

> 
> I have to say that, great as it would be to have this fixed, the changes 
> required would be extensive, and I can’t see that anyone would think it 
> worth the trouble.

One of unicode's biggest problems is that there's no longer any coherent vision -- it started off as a idea to offer one code-point per character in human language, but then shifted to glyph-building (hence combining characters), and as such lacks a unifying principle.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-04 17:30     ` Shark8
@ 2017-07-04 18:08       ` Dennis Lee Bieber
  2017-07-05  5:25       ` J-P. Rosen
  1 sibling, 0 replies; 22+ messages in thread
From: Dennis Lee Bieber @ 2017-07-04 18:08 UTC (permalink / raw)


On Tue, 4 Jul 2017 10:30:02 -0700 (PDT), Shark8 <onewingedshark@gmail.com>
declaimed the following:

>
>One of unicode's biggest problems is that there's no longer any coherent vision -- it started off as a idea to offer one code-point per character in human language, but then shifted to glyph-building (hence combining characters), and as such lacks a unifying principle.

	"glyph-building" though, does have a precedent: European typewriters
with "dead keys" (keys that made marks but did not advance the carriage
position)

-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-04 13:57   ` Simon Wright
  2017-07-04 17:30     ` Shark8
@ 2017-07-05  5:21     ` J-P. Rosen
  2017-07-05  9:47       ` Simon Wright
  1 sibling, 1 reply; 22+ messages in thread
From: J-P. Rosen @ 2017-07-05  5:21 UTC (permalink / raw)


Le 04/07/2017 à 15:57, Simon Wright a écrit :
> The reason for this apparently-bizarre message is[3] that macOS takes 
> the composed form (lowercase a acute) and converts it under the hood 
> to what HFS+ insists on, the fully decomposed form (lowercase a, combining 
> acute); thus the names are actually different even though they _look_ 
> the same.
Apparently, they use NFD (Normalization Form D). Normalization forms are
necessary to avoid a whole lot of problems, although Ada requires
normalization form C (ARM 2.1 (4.1/3)), or more precisely, it is
implementation defined if the text is not in NFC.

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-04 17:30     ` Shark8
  2017-07-04 18:08       ` Dennis Lee Bieber
@ 2017-07-05  5:25       ` J-P. Rosen
  2017-07-06 15:18         ` Shark8
  1 sibling, 1 reply; 22+ messages in thread
From: J-P. Rosen @ 2017-07-05  5:25 UTC (permalink / raw)


Le 04/07/2017 à 19:30, Shark8 a écrit :
> This is why I maintain that unicode is crap -- a mistake along the
> lines of C that will likely take *decades* for the rest of "the
> industry" / computer science to realize.
Please don't make such statements until you understand all the issues -
the problem of character sets is incredibly complicated.

>> I have to say that, great as it would be to have this fixed, the
>> changes required would be extensive, and I can’t see that anyone
>> would think it worth the trouble.
> One of unicode's biggest problems is that there's no longer any
> coherent vision -- it started off as a idea to offer one code-point
> per character in human language, but then shifted to glyph-building
> (hence combining characters), and as such lacks a unifying
> principle.
The unifying principle is the normalization forms. The fact that there
are several normalization forms comes from the difference between human
and computer needs.

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-05  5:21     ` J-P. Rosen
@ 2017-07-05  9:47       ` Simon Wright
  2017-07-05 11:20         ` J-P. Rosen
  0 siblings, 1 reply; 22+ messages in thread
From: Simon Wright @ 2017-07-05  9:47 UTC (permalink / raw)


"J-P. Rosen" <rosen@adalog.fr> writes:

> Le 04/07/2017 à 15:57, Simon Wright a écrit :
>> The reason for this apparently-bizarre message is[3] that macOS takes
>> the composed form (lowercase a acute) and converts it under the hood
>> to what HFS+ insists on, the fully decomposed form (lowercase a,
>> combining acute); thus the names are actually different even though
>> they _look_ the same.
> Apparently, they use NFD (Normalization Form D). Normalization forms
> are necessary to avoid a whole lot of problems, although Ada requires
> normalization form C (ARM 2.1 (4.1/3)), or more precisely, it is
> implementation defined if the text is not in NFC.

That reference specifies NFKC which I suppose is near! GNAT uses this if
either you compile with -gnatW8 or the file begins with a UTF8 BOM.

The problems I've noted in this thread in the GNAT implementation are
two:

(1) On Windows and macOS (and possibly on VMS, not sure if that's
relevant any more) the file name corresponding to a unit name is
converted to lower-case assuming it's Latin-1 -
System.Case_Util.To_Lower,

   function To_Lower (A : Character) return Character is
      A_Val : constant Natural := Character'Pos (A);

   begin
      if A in 'A' .. 'Z'
        or else A_Val in 16#C0# .. 16#D6#
        or else A_Val in 16#D8# .. 16#DE#
      then
         return Character'Val (A_Val + 16#20#);
      else
         return A;
      end if;
   end To_Lower;

This is the problem that prevents use of extended characters in unit
names.

(2) On macOS, the expected file name appears to be stored in NFC, but is
retrieved from the file system in NFD.

It seems this will only cause a problem if you compile the file (on its
own, not as part of the closure of another file - weird - possibly
because the wildcard picks up the NFD representation, while compiling as
part of the closure uses the NFC representation in the ALI?) with -gnatwe:

$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c -f p*.ads -gnatwe
gcc -c -gnatwe páck3.ads
páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"
gnatmake: "páck3.ads" compilation error

(this message was copied from Terminal and pasted into Emacs, which
makes clear the difference between the two representations; previously
I've copied from Terminal and pasted into Safari/Bugzilla, which
produced identical glyphs).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-05  9:47       ` Simon Wright
@ 2017-07-05 11:20         ` J-P. Rosen
  2017-07-05 18:42           ` Randy Brukardt
  2017-07-06 18:43           ` Simon Wright
  0 siblings, 2 replies; 22+ messages in thread
From: J-P. Rosen @ 2017-07-05 11:20 UTC (permalink / raw)


Le 05/07/2017 à 11:47, Simon Wright a écrit :
> That reference specifies NFKC which I suppose is near! 
Not that near when it comes to ligatures and other crazy characters...
But you are right, it's NFKC.

> GNAT uses this if
> either you compile with -gnatW8 or the file begins with a UTF8 BOM.
Actually, this has nothing to do with encoding or coded character sets.
Even if you use Latin-1, the set of allowed characters is defined as
those that belong to NFKC.

> The problems I've noted in this thread in the GNAT implementation are
> two:
> 
> (1) On Windows and macOS (and possibly on VMS, not sure if that's
> relevant any more) the file name corresponding to a unit name is
> converted to lower-case assuming it's Latin-1 -
> System.Case_Util.To_Lower,
I can talk about character issues since I gave that tutorial at AE'17...
How operating systems manage that, I don't know.

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-05 11:20         ` J-P. Rosen
@ 2017-07-05 18:42           ` Randy Brukardt
  2017-07-06 18:43           ` Simon Wright
  1 sibling, 0 replies; 22+ messages in thread
From: Randy Brukardt @ 2017-07-05 18:42 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1503 bytes --]

"J-P. Rosen" <rosen@adalog.fr> wrote in message 
news:ojihrl$qu2$1@dont-email.me...
> Le 05/07/2017 à 11:47, Simon Wright a écrit :
>> That reference specifies NFKC which I suppose is near!
> Not that near when it comes to ligatures and other crazy characters...
> But you are right, it's NFKC.

Actually, you were right the first time, but it doesn't show up in the Ada 
2012 as this is a recent correction (recall AI12-0004-1? It was just 
approved by WG 9 at the June meeting). NFKC is *definitely* the wrong rule.

Note that we chose NFC in part because WC3 recommends that all Internet 
content be in NFC, and because it is the more compact representation. I'm 
surprised that anyone would use NFD (since it can be three times larger than 
NFC), but I suppose I shouldn't ever be surprised by the choices of others. 
;-)

As always, you can see the *current* state of Ada by using the working draft 
RM (see http://www.ada-auth.org/standards/ada2x.html). For this rule, that 
is 2.1(4.1/5).

I suppose the working draft is a bit confusing for this use (that is, 
Ada-Comment) as corrections (like this) take effect immediately upon WG 9 
approval while amendments don't take effect until the next Standard update. 
You can tell them apart by looking at the bottom of each subclause at the 
"<something> from Ada 2012" (for instance, "Wording Changes from Ada 
2012") -- "corrections" are identified that way, while amendments are not 
identified specially.

                               Randy. 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-05  5:25       ` J-P. Rosen
@ 2017-07-06 15:18         ` Shark8
  2017-07-07  8:19           ` J-P. Rosen
  0 siblings, 1 reply; 22+ messages in thread
From: Shark8 @ 2017-07-06 15:18 UTC (permalink / raw)

On Tuesday, July 4, 2017 at 11:25:20 PM UTC-6, J-P. Rosen wrote:
> Le 04/07/2017 à 19:30, Shark8 a écrit :
> > This is why I maintain that unicode is crap -- a mistake along the
> > lines of C that will likely take *decades* for the rest of "the
> > industry" / computer science to realize.
> Please don't make such statements until you understand all the issues -
> the problem of character sets is incredibly complicated.

I'm not saying it isn't complicated; I'm saying that it could, and should, have been done better. Instead we get a bizarre Frankenstein's-monster of techniques where some character-glyphs are precomposed (with duplicates across multiple languages) and Zalgo-script is a thing. (see: https://eeemo.net/ )

Not only that, but there's the problem of strings; instead of doing something sensible ("but wasteful"*) by designing a "multilanguage string" that partitioned strings by language. Ex:

Type Language is (English, French, Russian); -- supported languages

Type Discriminated_String( Words : Language; Length : Natural ) is record
  Data : String(1..Length); -- Sequence of code-points/characters.
end record;

Package Discriminated_String_Vector is new Ada.Containers.Indefinite_Vector
  ( Index_Type => Positive, Element_Type => Discriminated_String );

Type Multi_Language_String is new Discriminated_String_Vector.Vector with null record;
-- New primitive operations.

And *THERE* you have a sane framework for managing multilingual text; granted *most* text would only /need/ a single element vector because most text is not multi-lingual; that's ok. The important part here is that the languages are kept distinct and clearly indicated. (This would also allow far more maintainability than unicode's system because you could then allow independent subgroups to manage their own language.)

> 
> >> I have to say that, great as it would be to have this fixed, the
> >> changes required would be extensive, and I can’t see that anyone
> >> would think it worth the trouble.
> > One of unicode's biggest problems is that there's no longer any
> > coherent vision -- it started off as a idea to offer one code-point
> > per character in human language, but then shifted to glyph-building
> > (hence combining characters), and as such lacks a unifying
> > principle.
> The unifying principle is the normalization forms. The fact that there
> are several normalization forms comes from the difference between human
> and computer needs.

Perhaps so, but there ought to be a way to identify such a context rather than just throwing these normalized forms in the UTF-string blender, shrugging, and handing it off to the programmers as "not my problem".

I mean as a counter-example ASN.1 has normalizing encodings like DER and CER, but these are (a) usually distinguished by being defined by their particular encoding, and when they aren't (b) are proper subsets of BER. [Much like subtypes in Ada and how we can use Natural & Positive for better describing our problem, but can use Integer when needed (ie foreign interfacing where the constraint might not be guarenteed).]

* -- Wasteful like keeping the bounds of an array seems wasteful to C programmers.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-05 11:20         ` J-P. Rosen
  2017-07-05 18:42           ` Randy Brukardt
@ 2017-07-06 18:43           ` Simon Wright
  2017-07-07  8:26             ` J-P. Rosen
  1 sibling, 1 reply; 22+ messages in thread
From: Simon Wright @ 2017-07-06 18:43 UTC (permalink / raw)

"J-P. Rosen" <rosen@adalog.fr> writes:

>> GNAT uses this if
>> either you compile with -gnatW8 or the file begins with a UTF8 BOM.
> Actually, this has nothing to do with encoding or coded character sets.
> Even if you use Latin-1, the set of allowed characters is defined as
> those that belong to NFKC.

I don't understand.

If your source has no BOM and you don't say -gnatW8, GNAT expects
Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT
expects UTF8 encoding (I haven't tried what happens if you use NFD).

I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the
use in unit names - ARM 2.1(16) says it should be accepted.

(later) UTF8 is accepted in strings but not in identifiers.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-06 15:18         ` Shark8
@ 2017-07-07  8:19           ` J-P. Rosen
  0 siblings, 0 replies; 22+ messages in thread
From: J-P. Rosen @ 2017-07-07  8:19 UTC (permalink / raw)


Le 06/07/2017 à 17:18, Shark8 a écrit :
> I'm not saying it isn't complicated; I'm saying that it could, and
> should, have been done better.
I'm willing to accept these kinds of statement only from people who
participated in the design...

> Instead we get a bizarre
> Frankenstein's-monster of techniques where some character-glyphs are
> precomposed (with duplicates across multiple languages) and
> Zalgo-script is a thing. (see: https://eeemo.net/ )
Yes, representation of characters is not unique. It's a compromise
between compacity, compatibility, exhaustivity...

> Not only that, but there's the problem of strings; instead of doing
> something sensible ("but wasteful"*) by designing a "multilanguage
> string" that partitioned strings by language. Ex:
This is total confusion. Unicode is about coded sets and encodings, it
has nothing to do with languages and internationalization.

>> The unifying principle is the normalization forms. The fact that
>> there are several normalization forms comes from the difference
>> between human and computer needs.
> 
> Perhaps so, but there ought to be a way to identify such a context
> rather than just throwing these normalized forms in the UTF-string
> blender, shrugging, and handing it off to the programmers as "not my
> problem".
Another confusion: normalization forms have nothing to do with encodings
(UTF or not). Normalization provides a unique representation of
composite characters that may be represented in several ways.

> I mean as a counter-example ASN.1 has normalizing encodings like DER
> and CER, but these are (a) usually distinguished by being defined by
> their particular encoding, and when they aren't (b) are proper
> subsets of BER. [Much like subtypes in Ada and how we can use Natural
> & Positive for better describing our problem, but can use Integer
> when needed (ie foreign interfacing where the constraint might not be
> guarenteed).]
I don't follow you here. ASN.1 is a representation of structured data,
and AFAIU does not specify which coded set is used.


-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-06 18:43           ` Simon Wright
@ 2017-07-07  8:26             ` J-P. Rosen
  2017-07-07 11:01               ` Simon Wright
  0 siblings, 1 reply; 22+ messages in thread
From: J-P. Rosen @ 2017-07-07  8:26 UTC (permalink / raw)


Le 06/07/2017 à 20:43, Simon Wright a écrit :
>> Even if you use Latin-1, the set of allowed characters is defined as
>> those that belong to NFKC.
> I don't understand.
> 
> If your source has no BOM and you don't say -gnatW8, GNAT expects
> Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT
> expects UTF8 encoding (I haven't tried what happens if you use NFD).
> 
> I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the
> use in unit names - ARM 2.1(16) says it should be accepted.
> 
> (later) UTF8 is accepted in strings but not in identifiers.

This is a common confusion between characters, coded sets, and encodings...

ISO-10646 defines a coded set (code points) for a number of characters
(identical to the one defined by Unicode). Some of these characters can
be represented in NFKC. These are the allowed characters.

If you use Latin-1, you have different code points for the same
characters - and the allowed characters are still those representable in
NFKC, even with different code points.

UTF8 is an encoding, nothing more than a compression algorithm for
numerical values. It is generally used to compress Unicode strings, but
could be used for any numerical values. In any case, it doesn't change
logical values, just the way they are stored.


-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-07  8:26             ` J-P. Rosen
@ 2017-07-07 11:01               ` Simon Wright
  2017-07-07 11:49                 ` Jacob Sparre Andersen
  2017-07-07 19:40                 ` Randy Brukardt
  0 siblings, 2 replies; 22+ messages in thread
From: Simon Wright @ 2017-07-07 11:01 UTC (permalink / raw)


"J-P. Rosen" <rosen@adalog.fr> writes:

> Le 06/07/2017 à 20:43, Simon Wright a écrit :
>>> Even if you use Latin-1, the set of allowed characters is defined as
>>> those that belong to NFKC.
>> I don't understand.
>> 
>> If your source has no BOM and you don't say -gnatW8, GNAT expects
>> Latin-1 encoding. If your source has a BOM or you say -gnatW8, GNAT
>> expects UTF8 encoding (I haven't tried what happens if you use NFD).
>> 
>> I haven't tried giving UTF8 coding without BOM or -gnatW8 - ignoring the
>> use in unit names - ARM 2.1(16) says it should be accepted.
>> 
>> (later) UTF8 is accepted in strings but not in identifiers.
>
> This is a common confusion between characters, coded sets, and encodings...
>
> ISO-10646 defines a coded set (code points) for a number of characters
> (identical to the one defined by Unicode). Some of these characters can
> be represented in NFKC. These are the allowed characters.
>
> If you use Latin-1, you have different code points for the same
> characters - and the allowed characters are still those representable in
> NFKC, even with different code points.
>
> UTF8 is an encoding, nothing more than a compression algorithm for
> numerical values. It is generally used to compress Unicode strings, but
> could be used for any numerical values. In any case, it doesn't change
> logical values, just the way they are stored.

I think this is a response to my "I don't understand" - I think I do
understand a little better now, thank you.

The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says

   "An Ada implementation shall accept Ada source code in UTF-8
   encoding, with or without a BOM (see A.4.11), where every character
   is represented by its code point."

which for GNAT is not met unless either there is a BOM or -gnatW8 is
used.

On the other hand, ARM 2.1(4/3) says "The coded representation for
characters is implementation defined", which seems to conflict with (16)
- but then, the AARM ramification (4.b/2) notes that the rule doesn't
have much force!


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-07 11:01               ` Simon Wright
@ 2017-07-07 11:49                 ` Jacob Sparre Andersen
  2017-07-07 19:44                   ` Randy Brukardt
  2017-07-07 19:40                 ` Randy Brukardt
  1 sibling, 1 reply; 22+ messages in thread
From: Jacob Sparre Andersen @ 2017-07-07 11:49 UTC (permalink / raw)


Simon Wright wrote:

> The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says
>
>    "An Ada implementation shall accept Ada source code in UTF-8
>    encoding, with or without a BOM (see A.4.11), where every character
>    is represented by its code point."
>
> which for GNAT is not met unless either there is a BOM or -gnatW8 is
> used.

Which sounds perfectly okay.

There are no limitations to which command-line arguments a program can
require to behave like an Ada compiler.

> On the other hand, ARM 2.1(4/3) says "The coded representation for
> characters is implementation defined", which seems to conflict with
> (16) - but then, the AARM ramification (4.b/2) notes that the rule
> doesn't have much force!

That sounds like the classical wording.

I suppose that the intent is that UTF-8 encoded ISO-10646 (in the right
normalization form) _has_ to be supported, but that any other encoding
is allowed in addition to that.

It would of course be nice if that was also what the ARM actually said.

Greetings,

Jacob
-- 
"Only Hogwarts students really need spellcheckers"
                                -- An anonymous RISKS reader


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-07 11:01               ` Simon Wright
  2017-07-07 11:49                 ` Jacob Sparre Andersen
@ 2017-07-07 19:40                 ` Randy Brukardt
  2017-07-07 21:02                   ` Simon Wright
  1 sibling, 1 reply; 22+ messages in thread
From: Randy Brukardt @ 2017-07-07 19:40 UTC (permalink / raw)


"Simon Wright" <simon@pushface.org> wrote in message 
news:lybmow1pfk.fsf@pushface.org...
...
> The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says
>
>   "An Ada implementation shall accept Ada source code in UTF-8
>   encoding, with or without a BOM (see A.4.11), where every character
>   is represented by its code point."
>
> which for GNAT is not met unless either there is a BOM or -gnatW8 is
> used.

The Standard says "shall accept"; it has nothing to say about what 
handstands are needed to get the required behavior. If GNAT required to 
chant "Ada is Great" toward New York and then Paris before accepting UTF-8 
source, it would still meet the requirement of the Standard. Certainly 
requiring the use of -gnatW8 to get the language required behavior is 
acceptable (recall that you have to use -gnatE and used to have to 
use -gnato to get the language required behavior in other areas).

> On the other hand, ARM 2.1(4/3) says "The coded representation for
> characters is implementation defined", which seems to conflict with (16)
> - but then, the AARM ramification (4.b/2) notes that the rule doesn't
> have much force!

An implementation can have other encodings (which are 
implementation-defined). The new rule (2.1(16/3)) mainly just reflects that 
practically, an Ada compiler has to be able to accept the source of the 
ACATS; we decided to require that in the Standard that so that there is a 
standard source form that every compiler is going to support. Thus it is now 
possible to portably write Ada source code as well as write a portable Ada 
program. (Practically, this was always true, but it's better to have it 
written in the Standard.)

                                        Randy.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-07 11:49                 ` Jacob Sparre Andersen
@ 2017-07-07 19:44                   ` Randy Brukardt
  0 siblings, 0 replies; 22+ messages in thread
From: Randy Brukardt @ 2017-07-07 19:44 UTC (permalink / raw)


"Jacob Sparre Andersen" <jacob@jacob-sparre.dk> wrote in message 
news:87inj4xy8q.fsf@jacob-sparre.dk...
...
>> On the other hand, ARM 2.1(4/3) says "The coded representation for
>> characters is implementation defined", which seems to conflict with
>> (16) - but then, the AARM ramification (4.b/2) notes that the rule
>> doesn't have much force!
>
> That sounds like the classical wording.
>
> I suppose that the intent is that UTF-8 encoded ISO-10646 (in the right
> normalization form) _has_ to be supported, but that any other encoding
> is allowed in addition to that.

Precisely.

> It would of course be nice if that was also what the ARM actually said.

Mostly we're not changing text that doesn't have to be changed. In some 
cases, it would make more sense if it was changed, but since every change 
has a potential for errors and unintended consequences, its often best to 
leave stuff alone. (There are many cases where a "simple" change broke 
something else, leading to repeated fixes.)

                     Randy.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: GNAT vs UTF-8 source file names
  2017-07-07 19:40                 ` Randy Brukardt
@ 2017-07-07 21:02                   ` Simon Wright
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Wright @ 2017-07-07 21:02 UTC (permalink / raw)


"Randy Brukardt" <randy@rrsoftware.com> writes:

> "Simon Wright" <simon@pushface.org> wrote in message 
> news:lybmow1pfk.fsf@pushface.org...
> ...
>> The rest is about GNAT's behaviour; to reiterate, ARM 2.1(16/3) says
>>
>>   "An Ada implementation shall accept Ada source code in UTF-8
>>   encoding, with or without a BOM (see A.4.11), where every character
>>   is represented by its code point."
>>
>> which for GNAT is not met unless either there is a BOM or -gnatW8 is
>> used.
>
> The Standard says "shall accept"; it has nothing to say about what 
> handstands are needed to get the required behavior

I suppose I'm more used to military requirements, where (IMO) handstands
would be unacceptable, and "shall accept" means just that. Perhaps
"shall be able to accept"? But (having read your other note) I see why
this isn't going to change.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-07-07 21:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-30 17:10 GNAT vs UTF-8 source file names Simon Wright
2017-06-17 17:20 ` Simon Wright
2017-06-27 13:22   ` Jacob Sparre Andersen
2017-06-27 21:45     ` Niklas Holsti
2017-06-28  5:05       ` G.B.
2017-07-04 13:57   ` Simon Wright
2017-07-04 17:30     ` Shark8
2017-07-04 18:08       ` Dennis Lee Bieber
2017-07-05  5:25       ` J-P. Rosen
2017-07-06 15:18         ` Shark8
2017-07-07  8:19           ` J-P. Rosen
2017-07-05  5:21     ` J-P. Rosen
2017-07-05  9:47       ` Simon Wright
2017-07-05 11:20         ` J-P. Rosen
2017-07-05 18:42           ` Randy Brukardt
2017-07-06 18:43           ` Simon Wright
2017-07-07  8:26             ` J-P. Rosen
2017-07-07 11:01               ` Simon Wright
2017-07-07 11:49                 ` Jacob Sparre Andersen
2017-07-07 19:44                   ` Randy Brukardt
2017-07-07 19:40                 ` Randy Brukardt
2017-07-07 21:02                   ` Simon Wright

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox