From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "G.B." Newsgroups: comp.lang.ada Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8 Date: Tue, 18 Oct 2016 17:10:35 +0200 Organization: A noiseless patient Spider Message-ID: References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com> Reply-To: nonlegitur@futureapps.de Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Tue, 18 Oct 2016 15:10:17 -0000 (UTC) Injection-Info: mx02.eternal-september.org; posting-host="2cc2841b4cef93c1f6e318df5b26f539"; logging-data="5277"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19q9DLg4FGaeYiex6ofvPcRYlqLN3o6TII=" User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 In-Reply-To: Cancel-Lock: sha1:r8pX5VlknELwXvw7Y9S6Tvq55G0= Xref: news.eternal-september.org comp.lang.ada:32120 Date: 2016-10-18T17:10:35+02:00 List-Id: On 18.10.16 14:24, Dmitry A. Kazakov wrote: >> still, any UTF-8 encoded "string" of UCS objects is wellformed >> and it satisfies a predicate that involves all components x, x', x'', ... >> of a UTF_8_String object, by stating that if x matches 2#10......#, >> then x' is such-and-such, and so on. I'm not sure this predicate >> is easily stated as a stand-alone type invariant, for example, but >> that's the idea. It shouldn't have to be visible to Ada programmers. > > Sorry, that is a meaningless set of words. Spelling out the look of model strings for type UTF_8_String cannot quite be meaningless. Ada Rationale: "Type invariants are designed for use with private types where we want some relationship to always hold between components of the type". (2.4) We want some relationship to always hold between components of type UTF_8_String (if it were private, so that 2.4 might formally apply): type UTF_Rep_Text is ... with Type_Invariant => (... (case UTF_Rep_Text (K) is when 2#10_000000# .. 2#10_111111# => (case UTF_Rep_Text (K + 1) is when 2#1_0000000# .. 2#1_1111111# => ...)) ...); > Type constraint is put on type values. (Type values or a type's values?) AI-05-0146: "invariants apply to all values of a type, while constraints are generally used to identify a subset of the values of a type". UTF_8_String does identify a subset of the values of type String, by intent, even if it takes more of the RM to see that: Since Strings are dumb insofar as they allow every value of type Character as a component, a string that is a well-formed UTF-8 sequence U of octets---each octet appears as a Characters---is in a subset of type String's. All are of finite length. These well-formed sequences U establish a subset of all possible String values. Call it UTF_8_String, not Unicode_String, nor UCS_String. As said, I don't think that the set's predicate is easy to state. With an aspect stating it, a purpoted UTF_8_String value that isn't will be dropped from the set, perhaps as loudly as raising Encoding_Error will be now. > Values of UTF-8 strings are not values of strings, as A-umlaut promptly demonstrates. Period. Of course they can be (ASCII subset). Also, the UTF_String thing is just a vague expression of what RM A.4.11(47/3) states mores specifically, for going from this representation oriented subtype to "real" characters from the UCS. But this is not otherwise reflected in the subtype, AFAICS. Considering UTF_String'("'Ä' is A-Umlaut"); the literal, if taken at face value. doesn't say which characters there are going to be. It takes an Ada compiler to interpret the source text and decide whether it is representing Latin-1 or a multi-octed sequence, possibly one that needs Wide_Character or Wide_Wide_Character. > "Remainders are values ... in Character" makes no sense either. Character'Val (N rem 256); -- "HOTDOGS ARE NOT BOOKMARKS" Springfield Elementary teaching staff