From: Georg Bauhaus <rm.dash-bauhaus@futureapps.de>
Subject: Re: strange behaviour of utf-8 files
Date: Fri, 22 Nov 2013 12:54:31 +0100
Date: 2013-11-22T12:54:32+01:00 [thread overview]
Message-ID: <528f45f8$0$6557$9b4e6d93@newsspool4.arcor-online.net> (raw)
In-Reply-To: <672ce4f6-8c65-43b5-b04b-a7b858205af8@googlegroups.com>
On 22.11.13 04:02, Shark8 wrote:
> On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote:
>>
>> P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no
>> support for any other encoding (of course it supports Wide_String at
>> runtime). That will have to change as we migrate to Ada 2012, but it
>> probably will be a while before that happens (not a lot of demand).
>
> Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from the customers]?
>
> (Also, would having a package aspect declaring that the /contents/ are to be read as UTF-8 [or any recognized encoding] be a possible workable solution to this problem? -- then you could have a package of String-constants of the proper encoding.)
For literals, in general, I think that static expression
functions will be valuable. I wonder why these have not
yet been defined?
For example, an implementation such as Janus/Ada reads
string literals as Latin-1, and therefore, then, static
expression functions could test properties of the literal.
(Length checks being another useful, though less reliable
option.)
Then, when read as Latin-1, the literal String_3'("§ 1")
in the Subject parameter of
Is_UTF_8 (First => 1, Subject => "§ 1")
would form part of a static expression that is checked at
compile time. In a static predicate, say.
package UTF_8_Checks is
pragma Pure (UTF_8_Checks);
-- (Not working statically, in current Ada.)
-- If:
-- - static functions include expression functions of only
-- static expressions,
--
-- then function Is_UTF_8 below can test a string literal
-- at compile time.
U0 : constant := 0;
U1 : constant := 2#1000_0000#;
U2 : constant := 2#1100_0000#;
U3 : constant := 2#1110_0000#;
U4 : constant := 2#1111_0000#;
U5 : constant := 2#1111_1000#;
UX : constant := 255;
subtype XString is String (1 .. 12)
with Static_Predicate => XString'Last < Positive'Last;
-- for string_literals of a static string subtype
type XInteger is range 0 .. 255;
function Is_UTF_8_Follow (C : Character) return Boolean is
-- an octet that has its most significant bit set, but
-- not the next one:
(Character'Pos (C) in U1 .. U2 - 1);
function Is_UTF_8 (First : Positive; Subject : XString) return Boolean is
-- every sequence of characters from Subject is a valid UTF-8
-- sequence, assuming code points up to 16#10_FFFF#.
(if First > Subject'Last then True
else
(case XInteger (Character'Pos (Subject (First))) is
when 0 .. U1 - 1 =>
-- "ASCII 7 bit"
Is_UTF_8 (First + 1, Subject),
when U1 .. U2 - 1 =>
-- handled by Is_UTF_8_Follow
False,
when U2 .. U3 - 1 =>
(if First > Subject'Last - 1 then False
else
(for all j in 1 .. 1 =>
Is_UTF_8_Follow (Subject (First + j)))
and
Is_UTF_8 (First + 2, Subject)),
when U3 .. U4 - 1 =>
(if First > Subject'Last - 2 then False
else
(for all j in 1 .. 2 =>
Is_UTF_8_Follow (Subject (First + j)))
and
Is_UTF_8 (First + 3, Subject)),
when U4 .. U5 - 1 =>
(if First > Subject'Last - 3 then False
else
(for all j in 1 .. 3 =>
Is_UTF_8_Follow (Subject (First + j)))
and
Is_UTF_8 (First + 4, Subject)),
when U5 .. UX =>
False));
end UTF_8_Checks;
next prev parent reply other threads:[~2013-11-22 11:54 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-16 13:12 strange behaviour of utf-8 files Stoik
2013-11-16 13:34 ` Dmitry A. Kazakov
2013-11-16 15:09 ` Stoik
2013-11-16 15:55 ` Dmitry A. Kazakov
2013-11-17 13:32 ` Georg Bauhaus
2013-11-17 14:07 ` Dmitry A. Kazakov
2013-11-17 17:19 ` Dennis Lee Bieber
2013-11-17 18:07 ` Dmitry A. Kazakov
2013-11-17 19:05 ` Georg Bauhaus
2013-11-17 20:38 ` Dmitry A. Kazakov
2013-11-18 8:38 ` Georg Bauhaus
2013-11-18 9:01 ` Dmitry A. Kazakov
2013-11-18 10:06 ` Georg Bauhaus
2013-11-18 8:44 ` Georg Bauhaus
2013-11-18 10:24 ` Dmitry A. Kazakov
2013-11-18 13:05 ` G.B.
2013-11-18 15:25 ` Dmitry A. Kazakov
2013-11-18 15:51 ` G.B.
2013-11-18 17:34 ` Dmitry A. Kazakov
2013-11-18 0:34 ` Stoik
2013-11-16 17:01 ` Georg Bauhaus
2013-11-17 10:38 ` Stoik
2013-11-16 15:12 ` Stoik
2013-11-16 15:57 ` Dmitry A. Kazakov
2013-11-17 11:12 ` Stoik
2013-11-22 1:03 ` Randy Brukardt
2013-11-22 3:02 ` Shark8
2013-11-22 11:54 ` Georg Bauhaus [this message]
2013-11-23 4:14 ` Randy Brukardt
2013-12-06 2:17 ` Georg Bauhaus
2013-11-16 20:06 ` Peter C. Chapin
2013-11-17 10:34 ` Stoik
2013-11-22 0:53 ` Randy Brukardt
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox