comp.lang.ada
 help / color / mirror / Atom feed
From: Georg Bauhaus <rm.dash-bauhaus@futureapps.de>
Subject: Re: strange behaviour of utf-8 files
Date: Fri, 22 Nov 2013 12:54:31 +0100
Date: 2013-11-22T12:54:32+01:00	[thread overview]
Message-ID: <528f45f8$0$6557$9b4e6d93@newsspool4.arcor-online.net> (raw)
In-Reply-To: <672ce4f6-8c65-43b5-b04b-a7b858205af8@googlegroups.com>

On 22.11.13 04:02, Shark8 wrote:
> On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote:
>>
>> P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no
>> support for any other encoding (of course it supports Wide_String at
>> runtime). That will have to change as we migrate to Ada 2012, but it
>> probably will be a while before that happens (not a lot of demand).
>
> Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from the customers]?
>
> (Also, would having a package aspect declaring that the /contents/ are to be read as UTF-8 [or any recognized encoding] be a possible workable solution to this problem? -- then you could have a package of String-constants of the proper encoding.)

For literals, in general, I think that static expression
functions will be valuable. I wonder why these have not
yet been defined?

For example, an implementation such as Janus/Ada reads
string literals as Latin-1, and therefore, then, static
expression functions could test properties of the literal.
(Length checks being another useful, though less reliable
option.)

Then, when read as Latin-1, the literal String_3'("§ 1")
in the Subject parameter of

    Is_UTF_8 (First => 1, Subject => "§ 1")

would form part of a static expression that is checked at
compile time. In a static predicate, say.


package UTF_8_Checks is

    pragma Pure (UTF_8_Checks);

    --  (Not working statically, in current Ada.)

    --  If:
    --    - static functions include expression functions of only
    --      static expressions,
    --
    --  then function Is_UTF_8 below can test a string literal
    --  at compile time.

    U0 : constant := 0;
    U1 : constant := 2#1000_0000#;
    U2 : constant := 2#1100_0000#;
    U3 : constant := 2#1110_0000#;
    U4 : constant := 2#1111_0000#;
    U5 : constant := 2#1111_1000#;
    UX : constant := 255;

    subtype XString is String (1 .. 12)
       with Static_Predicate => XString'Last < Positive'Last;
    --  for string_literals of a static string subtype

    type XInteger is range 0 .. 255;

    function Is_UTF_8_Follow (C : Character) return Boolean is
       --  an octet that has its most significant bit set, but
       --  not the next one:
       (Character'Pos (C) in U1 .. U2 - 1);

    function Is_UTF_8 (First : Positive; Subject : XString) return Boolean is
       --  every sequence of characters from Subject is a valid UTF-8
       --  sequence, assuming code points up to 16#10_FFFF#.
      (if First > Subject'Last then True
       else
         (case XInteger (Character'Pos (Subject (First))) is
             when 0 .. U1 - 1 =>
                --  "ASCII 7 bit"
               Is_UTF_8 (First + 1, Subject),

             when U1 .. U2 - 1 =>
                --  handled by Is_UTF_8_Follow
                False,

             when U2 .. U3 - 1 =>
               (if First > Subject'Last - 1 then False
                else
                  (for all j in 1 .. 1 =>
                     Is_UTF_8_Follow (Subject (First + j)))
                  and
                  Is_UTF_8 (First + 2, Subject)),

             when U3 .. U4 - 1 =>
               (if First > Subject'Last - 2 then False
                else
                  (for all j in 1 .. 2 =>
                     Is_UTF_8_Follow (Subject (First + j)))
                  and
                  Is_UTF_8 (First + 3, Subject)),

             when U4 .. U5 - 1 =>
               (if First > Subject'Last - 3 then False
                else
                  (for all j in 1 .. 3 =>
                     Is_UTF_8_Follow (Subject (First + j)))
                  and
                  Is_UTF_8 (First + 4, Subject)),

             when U5 .. UX =>
                False));

end UTF_8_Checks;

  reply	other threads:[~2013-11-22 11:54 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-16 13:12 strange behaviour of utf-8 files Stoik
2013-11-16 13:34 ` Dmitry A. Kazakov
2013-11-16 15:09   ` Stoik
2013-11-16 15:55     ` Dmitry A. Kazakov
2013-11-17 13:32       ` Georg Bauhaus
2013-11-17 14:07         ` Dmitry A. Kazakov
2013-11-17 17:19           ` Dennis Lee Bieber
2013-11-17 18:07             ` Dmitry A. Kazakov
2013-11-17 19:05           ` Georg Bauhaus
2013-11-17 20:38             ` Dmitry A. Kazakov
2013-11-18  8:38               ` Georg Bauhaus
2013-11-18  9:01                 ` Dmitry A. Kazakov
2013-11-18 10:06                   ` Georg Bauhaus
2013-11-18  8:44               ` Georg Bauhaus
2013-11-18 10:24                 ` Dmitry A. Kazakov
2013-11-18 13:05                   ` G.B.
2013-11-18 15:25                     ` Dmitry A. Kazakov
2013-11-18 15:51                       ` G.B.
2013-11-18 17:34                         ` Dmitry A. Kazakov
2013-11-18  0:34           ` Stoik
2013-11-16 17:01     ` Georg Bauhaus
2013-11-17 10:38       ` Stoik
2013-11-16 15:12   ` Stoik
2013-11-16 15:57     ` Dmitry A. Kazakov
2013-11-17 11:12       ` Stoik
2013-11-22  1:03         ` Randy Brukardt
2013-11-22  3:02           ` Shark8
2013-11-22 11:54             ` Georg Bauhaus [this message]
2013-11-23  4:14             ` Randy Brukardt
2013-12-06  2:17               ` Georg Bauhaus
2013-11-16 20:06     ` Peter C. Chapin
2013-11-17 10:34       ` Stoik
2013-11-22  0:53       ` Randy Brukardt
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox