From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "G.B." Newsgroups: comp.lang.ada Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8 Date: Fri, 21 Oct 2016 14:28:51 +0200 Organization: A noiseless patient Spider Message-ID: References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com> Reply-To: nonlegitur@futureapps.de Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Fri, 21 Oct 2016 12:28:30 -0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="408c4196dc03a8f498a4e8175a3f9016"; logging-data="6748"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Wz20VGhHUxhAmE3Nx904MWzcgoGVHfO8=" User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 In-Reply-To: Cancel-Lock: sha1:0jGIAbmcTYri1IeQZAGS6Jpyvjc= Xref: news.eternal-september.org comp.lang.ada:32153 Date: 2016-10-21T14:28:51+02:00 List-Id: On 20.10.16 09:36, Dmitry A. Kazakov wrote: > On 20/10/2016 02:31, Randy Brukardt wrote: >> "Dmitry A. Kazakov" wrote in message >> news:nu4nee$18le$1@gioia.aioe.org... >> ... >>> Numeric character is a constraint expressible in Ada: >>> >>> subtype Numeric is Character range '0'..'9'; >>> >>> Numeric string constraint is not expressible, but it still a constraint. >> >> It's expressible as a predicate, though; that's the entire point of >> predicates (to act like user-defined constraints): >> >> subtype Numeric_String is String >> with Dynamic_Predicate => (for all E of Numeric_String => E in >> Numeric); >> >> It's not 100% as good as a constraint (as modifications of individual >> components won't be checked), but it almost always will do the job. > > Not nice. Is there a reason why, apart from premature optimization? I think you can add an aspect to the component type and have that checked on assignment to a component. The aspect could somehow be different from the constraint, also just repeating it appears to loop infinitely with current GNATs. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78066 Anyway, a little inconvenience for starters: subtype My_Utf_8_String is String -- or, when not String, some array of any component type -- suitable as a byte sequence item type with Dynamic_Predicate => Is_Well_Formed (My_Utf_8_String); Bom: constant String := String'(Character'Val (16#EF#), Character'Val (16#BB#), Character'Val (16#BF#)); function Has_Bom (U8: String) return Boolean is (U8'Length >= 3 and then U8 (U8'First .. U8'First + 2) = Bom); function "abs" is new Ada.Unchecked_Conversion (Character, Interfaces.Unsigned_8); function Is_Well_Formed (U8 : String) return Boolean is -- `U8` has permissible bit patterns for all bytes. (No Table 3.7 -- support.) ((if U8'Length > 0 then (if Has_Bom (U8) then Is_Well_Formed (U8 (U8'First + 3 .. U8'Last)) else (for all J in U8'Range => (case abs U8 (J) is when 2#0_0000000# .. 2#0_1111111# => -- ASCII compatibility True, when 2#10_000000# .. 2#10_111111# => -- is a following byte (if J > U8'First then (abs U8 (J - 1) in 2#110_00000# .. 2#110_11111# or abs U8 (J - 1) in 2#1110_0000# .. 2#1110_1111# or abs U8 (J - 1) in 2#11110_000# .. 2#11110_111#) else False ), when 2#110_00000# .. 2#110_11111# => (if J < U8'Last then (abs U8 (J + 1) in 2#10_000000# .. 2#10_111111#) else False), when 2#1110_0000# .. 2#1110_1111# => (if J + 1 < U8'Last then (for all K in J + 1 .. J + 2 => abs U8 (K) in 2#10_000000# .. 2#10_111111#) else False ), when 2#11110_000# .. 2#11110_111# => (if J + 2 < U8'Last then (for all K in J + 1 .. J + 3 => abs U8 (K) in 2#10_000000# .. 2#10_111111#) else False ), when 2#11111_000# .. 2#11111_111# => -- not in Table 3.6 (UTF-8 Bit Distribution) False ) ) ) -- String of length 0: else True)); Test_Bom : constant My_Utf_8_String := Bom & "ABC"; Test_US : constant My_Utf_8_String := "ABC"; Test_GR : constant My_Utf_8_String := "ΑΒΓ"; Test_RU : constant My_Utf_8_String := "АБГ"; Test_Xx : constant My_Utf_8_String := ('A', Character'Val (16#E4#), 'E');