From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!feeder.erje.net!eu.feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!newsfeed.arcor.de!newsspool3.arcor-online.net!news.arcor.de.POSTED!not-for-mail
Date: Fri, 22 Nov 2013 12:54:31 +0100
From: Georg Bauhaus <rm.dash-bauhaus@futureapps.de>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:24.0) Gecko/20100101 Thunderbird/24.1.0
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: strange behaviour of utf-8 files
References: <73e0853b-454a-467f-9dc7-84ca5b9c29b2@googlegroups.com>
 <1ghx537y5gbfq.17oazom68d4n6.dlg@40tude.net>
 <5bf1b290-70bc-4240-b27c-120ce6b0b840@googlegroups.com>
 <z2fwn0g0hlr3$.1bktkfuljfy6b.dlg@40tude.net>
 <7464679c-6b98-4e23-a337-83b671473553@googlegroups.com>
 <l6mah4$3c1$1@loke.gir.dk>
 <672ce4f6-8c65-43b5-b04b-a7b858205af8@googlegroups.com>
In-Reply-To: <672ce4f6-8c65-43b5-b04b-a7b858205af8@googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Message-ID: <528f45f8$0$6557$9b4e6d93@newsspool4.arcor-online.net>
Organization: Arcor
NNTP-Posting-Date: 22 Nov 2013 12:54:32 CET
NNTP-Posting-Host: 69297ccc.newsspool4.arcor-online.net
X-Trace: 
 DXC=4f4QdHL@NC3^8FBo0_81f>4IUK<Cl32<14Fo<]lROoR18kF<OcfhCO;hn@@Y2U:<l>PCY\c7>ejV8RkEeDe5Vfh9jL]jihSIeR<
X-Complaints-To: usenet-abuse@arcor.de
Xref: news.eternal-september.org comp.lang.ada:17770
Date: 2013-11-22T12:54:32+01:00
List-Id: <comp.lang.ada>

On 22.11.13 04:02, Shark8 wrote:
> On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote:
>>
>> P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no
>> support for any other encoding (of course it supports Wide_String at
>> runtime). That will have to change as we migrate to Ada 2012, but it
>> probably will be a while before that happens (not a lot of demand).
>
> Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from the customers]?
>
> (Also, would having a package aspect declaring that the /contents/ are to be read as UTF-8 [or any recognized encoding] be a possible workable solution to this problem? -- then you could have a package of String-constants of the proper encoding.)

For literals, in general, I think that static expression
functions will be valuable. I wonder why these have not
yet been defined?

For example, an implementation such as Janus/Ada reads
string literals as Latin-1, and therefore, then, static
expression functions could test properties of the literal.
(Length checks being another useful, though less reliable
option.)

Then, when read as Latin-1, the literal String_3'("§ 1")
in the Subject parameter of

    Is_UTF_8 (First => 1, Subject => "§ 1")

would form part of a static expression that is checked at
compile time. In a static predicate, say.


package UTF_8_Checks is

    pragma Pure (UTF_8_Checks);

    --  (Not working statically, in current Ada.)

    --  If:
    --    - static functions include expression functions of only
    --      static expressions,
    --
    --  then function Is_UTF_8 below can test a string literal
    --  at compile time.

    U0 : constant := 0;
    U1 : constant := 2#1000_0000#;
    U2 : constant := 2#1100_0000#;
    U3 : constant := 2#1110_0000#;
    U4 : constant := 2#1111_0000#;
    U5 : constant := 2#1111_1000#;
    UX : constant := 255;

    subtype XString is String (1 .. 12)
       with Static_Predicate => XString'Last < Positive'Last;
    --  for string_literals of a static string subtype

    type XInteger is range 0 .. 255;

    function Is_UTF_8_Follow (C : Character) return Boolean is
       --  an octet that has its most significant bit set, but
       --  not the next one:
       (Character'Pos (C) in U1 .. U2 - 1);

    function Is_UTF_8 (First : Positive; Subject : XString) return Boolean is
       --  every sequence of characters from Subject is a valid UTF-8
       --  sequence, assuming code points up to 16#10_FFFF#.
      (if First > Subject'Last then True
       else
         (case XInteger (Character'Pos (Subject (First))) is
             when 0 .. U1 - 1 =>
                --  "ASCII 7 bit"
               Is_UTF_8 (First + 1, Subject),

             when U1 .. U2 - 1 =>
                --  handled by Is_UTF_8_Follow
                False,

             when U2 .. U3 - 1 =>
               (if First > Subject'Last - 1 then False
                else
                  (for all j in 1 .. 1 =>
                     Is_UTF_8_Follow (Subject (First + j)))
                  and
                  Is_UTF_8 (First + 2, Subject)),

             when U3 .. U4 - 1 =>
               (if First > Subject'Last - 2 then False
                else
                  (for all j in 1 .. 2 =>
                     Is_UTF_8_Follow (Subject (First + j)))
                  and
                  Is_UTF_8 (First + 3, Subject)),

             when U4 .. U5 - 1 =>
               (if First > Subject'Last - 3 then False
                else
                  (for all j in 1 .. 3 =>
                     Is_UTF_8_Follow (Subject (First + j)))
                  and
                  Is_UTF_8 (First + 4, Subject)),

             when U5 .. UX =>
                False));

end UTF_8_Checks;