From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail From: "Dmitry A. Kazakov" Newsgroups: comp.lang.ada Subject: Re: Bug in Ada - Latin 1 is not a subset of UTF-8 Date: Tue, 18 Oct 2016 18:35:42 +0200 Organization: Aioe.org NNTP Server Message-ID: References: <86f0d2fe-d498-4bc4-bb9d-e34629c89bb4@googlegroups.com> NNTP-Posting-Host: XXXaKfQ6zzC8DMOzOT/pgA.user.gioia.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 X-Notice: Filtered by postfilter v. 0.8.2 Xref: news.eternal-september.org comp.lang.ada:32124 Date: 2016-10-18T18:35:42+02:00 List-Id: On 2016-10-18 17:10, G.B. wrote: > On 18.10.16 14:24, Dmitry A. Kazakov wrote: > >>> still, any UTF-8 encoded "string" of UCS objects is wellformed >>> and it satisfies a predicate that involves all components x, x', x'', >>> ... >>> of a UTF_8_String object, by stating that if x matches 2#10......#, >>> then x' is such-and-such, and so on. I'm not sure this predicate >>> is easily stated as a stand-alone type invariant, for example, but >>> that's the idea. It shouldn't have to be visible to Ada programmers. >> >> Sorry, that is a meaningless set of words. > > Spelling out the look of model strings for type UTF_8_String > cannot quite be meaningless. > > Ada Rationale: > > "Type invariants are designed for use with private types > where we want some relationship to always hold between > components of the type". (2.4) That is completely irrelevant. No invariant can make Latin-1 A-umlaut UTF-8 A-umlaut. >> Type constraint is put on type values. > > (Type values or a type's values?) Values of a type, E.g. Positive is constrained Integer. > UTF_8_String does identify a subset of the values of > type String, by intent, No, it does not, that is why this implementation is broken. UTF-8 strings can be represented by String, they can be represented by Boolean arrays or by indefinite integers or by polygons. That does not make a them Boolean array subtype. No way. >> Values of UTF-8 strings are not values of strings, as A-umlaut >> promptly demonstrates. Period. > > Of course they can be (ASCII subset). A-umlaut is not ASCII. > But this is not otherwise reflected in the subtype, AFAICS. > Considering > > UTF_String'("'Ä' is A-Umlaut"); > > the literal, if taken at face value. doesn't say which characters > there are going to be. It does exactly this, once you define "character". > It takes an Ada compiler to interpret the > source text and decide whether it is representing Latin-1 or > a multi-octed sequence, possibly one that needs Wide_Character > or Wide_Wide_Character. There is nothing to interpret considering literals of Universal_String. It is no different from the way Universal_Integer is handled. String and UTF-8 string and Wide string can be considered subtypes of Universal_String, that does not have effect on the relationships between String and UTF-8 string. Same if literals considered overloaded functions. No different. >> "Remainders are values ... in Character" makes no sense either. > > Character'Val (N rem 256); So what? Numeric characters is still a constrained subtype of character type. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de