comp.lang.ada
 help / color / mirror / Atom feed
From: "G.B." <bauhaus@futureapps.invalid>
Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8
Date: Fri, 21 Oct 2016 14:28:51 +0200
Date: 2016-10-21T14:28:51+02:00	[thread overview]
Message-ID: <nud1le$6is$1@dont-email.me> (raw)
In-Reply-To: <nu9s5v$18f0$1@gioia.aioe.org>

On 20.10.16 09:36, Dmitry A. Kazakov wrote:
> On 20/10/2016 02:31, Randy Brukardt wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
>> news:nu4nee$18le$1@gioia.aioe.org...
>> ...
>>> Numeric character is a constraint expressible in Ada:
>>>
>>>    subtype Numeric is Character range '0'..'9';
>>>
>>> Numeric string constraint is not expressible, but it still a constraint.
>>
>> It's expressible as a predicate, though; that's the entire point of
>> predicates (to act like user-defined constraints):
>>
>>     subtype Numeric_String is String
>>         with Dynamic_Predicate => (for all E of Numeric_String => E in
>> Numeric);
>>
>> It's not 100% as good as a constraint (as modifications of individual
>> components won't be checked), but it almost always will do the job.
>
> Not nice. Is there a reason why, apart from premature optimization?

I think you can add an aspect to the component type
and have that checked on assignment to a component.
The aspect could somehow be different from the
constraint, also just repeating it appears to loop infinitely
with current GNATs.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78066


Anyway, a little inconvenience for starters:

     subtype My_Utf_8_String is String
       --  or, when not String, some array of any component type
       --  suitable as a byte sequence item type
       with Dynamic_Predicate => Is_Well_Formed (My_Utf_8_String);

     Bom: constant String := String'(Character'Val (16#EF#),
                                     Character'Val (16#BB#),
                                     Character'Val (16#BF#));

     function Has_Bom (U8: String) return Boolean is
       (U8'Length >= 3
          and then U8 (U8'First .. U8'First + 2) = Bom);

     function "abs" is new Ada.Unchecked_Conversion
       (Character, Interfaces.Unsigned_8);

     function Is_Well_Formed (U8 : String) return Boolean is
     --  `U8` has permissible bit patterns for all bytes. (No Table 3.7
     --  support.)
       ((if U8'Length > 0 then
           (if Has_Bom (U8)
            then
              Is_Well_Formed (U8 (U8'First + 3 .. U8'Last))
            else
              (for all J in U8'Range =>
                  (case abs U8 (J) is
                      when 2#0_0000000# .. 2#0_1111111# =>
                          --  ASCII compatibility
                          True,
                      when 2#10_000000# .. 2#10_111111# =>
                          --  is a following byte
                         (if J > U8'First then
                            (abs U8 (J - 1)
                               in 2#110_00000# .. 2#110_11111#
                               or abs U8 (J - 1)
                               in 2#1110_0000# .. 2#1110_1111#
                               or abs U8 (J - 1)
                               in 2#11110_000# .. 2#11110_111#)
                          else
                            False
                         ),
                      when 2#110_00000# .. 2#110_11111# =>
                         (if J < U8'Last then
                            (abs U8 (J + 1)
                               in 2#10_000000# .. 2#10_111111#)
                          else
                            False),
                      when 2#1110_0000# .. 2#1110_1111# =>
                         (if J + 1 < U8'Last then
                            (for all K in J + 1 .. J + 2 =>
                               abs U8 (K)
                               in 2#10_000000# .. 2#10_111111#)
                          else
                            False
                         ),
                      when 2#11110_000# .. 2#11110_111# =>
                         (if J + 2 < U8'Last then
                            (for all K in J + 1 .. J + 3 =>
                               abs U8 (K)
                               in 2#10_000000# .. 2#10_111111#)
                          else
                            False
                         ),
                      when 2#11111_000# .. 2#11111_111# =>
                          --  not in Table 3.6 (UTF-8 Bit Distribution)
                          False
                  )
              )
           )
           --  String of length 0:
         else True));

     Test_Bom : constant My_Utf_8_String := Bom & "ABC";
     Test_US : constant My_Utf_8_String := "ABC";
     Test_GR : constant My_Utf_8_String := "ΑΒΓ";
     Test_RU : constant My_Utf_8_String := "АБГ";
     Test_Xx : constant My_Utf_8_String :=
       ('A', Character'Val (16#E4#), 'E');


  reply	other threads:[~2016-10-21 12:28 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-17 20:18 Bug in Ada - Latin 1 is not a subset of UTF-8 Lucretia
2016-10-17 20:57 ` Jacob Sparre Andersen
2016-10-18  5:44   ` J-P. Rosen
2016-10-17 23:25 ` G.B.
2016-10-18  7:41   ` Dmitry A. Kazakov
2016-10-18  8:23     ` G.B.
2016-10-18  8:45       ` Dmitry A. Kazakov
2016-10-18 10:09         ` G.B.
2016-10-18 12:24           ` Dmitry A. Kazakov
2016-10-18 15:10             ` G.B.
2016-10-18 16:35               ` Dmitry A. Kazakov
2016-10-18 17:35                 ` G.B.
2016-10-18 20:03                   ` Dmitry A. Kazakov
2016-10-19  8:15                     ` G.B.
2016-10-19  8:25                       ` G.B.
2016-10-19  8:49                       ` Dmitry A. Kazakov
2016-10-19 14:20                         ` G.B.
2016-10-19 16:20                           ` Dmitry A. Kazakov
2016-10-20  0:31         ` Randy Brukardt
2016-10-20  7:36           ` Dmitry A. Kazakov
2016-10-21 12:28             ` G.B. [this message]
2016-10-21 16:13               ` Lucretia
2016-10-21 16:43                 ` Dmitry A. Kazakov
2016-10-22  5:51                   ` G.B.
2016-10-22  7:49                     ` Dmitry A. Kazakov
2016-10-24 11:35                       ` Luke A. Guest
2016-10-24 13:01                         ` Dmitry A. Kazakov
2016-10-24 14:54                           ` Luke A. Guest
2016-10-22  1:53             ` Randy Brukardt
2016-10-28 21:08         ` Shark8
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox