From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!feeder.erje.net!eu.feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!newsfeed.arcor.de!newsspool3.arcor-online.net!news.arcor.de.POSTED!not-for-mail Date: Fri, 22 Nov 2013 12:54:31 +0100 From: Georg Bauhaus User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: strange behaviour of utf-8 files References: <73e0853b-454a-467f-9dc7-84ca5b9c29b2@googlegroups.com> <1ghx537y5gbfq.17oazom68d4n6.dlg@40tude.net> <5bf1b290-70bc-4240-b27c-120ce6b0b840@googlegroups.com> <7464679c-6b98-4e23-a337-83b671473553@googlegroups.com> <672ce4f6-8c65-43b5-b04b-a7b858205af8@googlegroups.com> In-Reply-To: <672ce4f6-8c65-43b5-b04b-a7b858205af8@googlegroups.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Message-ID: <528f45f8$0$6557$9b4e6d93@newsspool4.arcor-online.net> Organization: Arcor NNTP-Posting-Date: 22 Nov 2013 12:54:32 CET NNTP-Posting-Host: 69297ccc.newsspool4.arcor-online.net X-Trace: DXC=4f4QdHL@NC3^8FBo0_81f>4IUKPCY\c7>ejV8RkEeDe5Vfh9jL]jihSIeR< X-Complaints-To: usenet-abuse@arcor.de Xref: news.eternal-september.org comp.lang.ada:17770 Date: 2013-11-22T12:54:32+01:00 List-Id: On 22.11.13 04:02, Shark8 wrote: > On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote: >> >> P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no >> support for any other encoding (of course it supports Wide_String at >> runtime). That will have to change as we migrate to Ada 2012, but it >> probably will be a while before that happens (not a lot of demand). > > Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from the customers]? > > (Also, would having a package aspect declaring that the /contents/ are to be read as UTF-8 [or any recognized encoding] be a possible workable solution to this problem? -- then you could have a package of String-constants of the proper encoding.) For literals, in general, I think that static expression functions will be valuable. I wonder why these have not yet been defined? For example, an implementation such as Janus/Ada reads string literals as Latin-1, and therefore, then, static expression functions could test properties of the literal. (Length checks being another useful, though less reliable option.) Then, when read as Latin-1, the literal String_3'("§ 1") in the Subject parameter of Is_UTF_8 (First => 1, Subject => "§ 1") would form part of a static expression that is checked at compile time. In a static predicate, say. package UTF_8_Checks is pragma Pure (UTF_8_Checks); -- (Not working statically, in current Ada.) -- If: -- - static functions include expression functions of only -- static expressions, -- -- then function Is_UTF_8 below can test a string literal -- at compile time. U0 : constant := 0; U1 : constant := 2#1000_0000#; U2 : constant := 2#1100_0000#; U3 : constant := 2#1110_0000#; U4 : constant := 2#1111_0000#; U5 : constant := 2#1111_1000#; UX : constant := 255; subtype XString is String (1 .. 12) with Static_Predicate => XString'Last < Positive'Last; -- for string_literals of a static string subtype type XInteger is range 0 .. 255; function Is_UTF_8_Follow (C : Character) return Boolean is -- an octet that has its most significant bit set, but -- not the next one: (Character'Pos (C) in U1 .. U2 - 1); function Is_UTF_8 (First : Positive; Subject : XString) return Boolean is -- every sequence of characters from Subject is a valid UTF-8 -- sequence, assuming code points up to 16#10_FFFF#. (if First > Subject'Last then True else (case XInteger (Character'Pos (Subject (First))) is when 0 .. U1 - 1 => -- "ASCII 7 bit" Is_UTF_8 (First + 1, Subject), when U1 .. U2 - 1 => -- handled by Is_UTF_8_Follow False, when U2 .. U3 - 1 => (if First > Subject'Last - 1 then False else (for all j in 1 .. 1 => Is_UTF_8_Follow (Subject (First + j))) and Is_UTF_8 (First + 2, Subject)), when U3 .. U4 - 1 => (if First > Subject'Last - 2 then False else (for all j in 1 .. 2 => Is_UTF_8_Follow (Subject (First + j))) and Is_UTF_8 (First + 3, Subject)), when U4 .. U5 - 1 => (if First > Subject'Last - 3 then False else (for all j in 1 .. 3 => Is_UTF_8_Follow (Subject (First + j))) and Is_UTF_8 (First + 4, Subject)), when U5 .. UX => False)); end UTF_8_Checks;