From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.152.87.104 with SMTP id w8mr249411laz.8.1384769252097; Mon, 18 Nov 2013 02:07:32 -0800 (PST) Path: border1.nntp.dca3.giganews.com!backlog3.nntp.dca3.giganews.com!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!h9no13526931wic.1!news-out.google.com!ub20ni5196wib.1!nntp.google.com!feeder1-2.proxad.net!proxad.net!feeder2-2.proxad.net!newsfeed.arcor.de!newsspool1.arcor-online.net!news.arcor.de.POSTED!not-for-mail Date: Mon, 18 Nov 2013 11:06:45 +0100 From: Georg Bauhaus User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: strange behaviour of utf-8 files References: <73e0853b-454a-467f-9dc7-84ca5b9c29b2@googlegroups.com> <1ghx537y5gbfq.17oazom68d4n6.dlg@40tude.net> <9d00683c-949c-4e88-a161-ebd78b350d39@googlegroups.com> <1w23uq33ul2i8$.wzjpp3evot36.dlg@40tude.net> <5288c584$0$6639$9b4e6d93@newsspool2.arcor-online.net> <52891372$0$6636$9b4e6d93@newsspool2.arcor-online.net> <10ec0vuld83fy.1t7bduzwsrfe.dlg@40tude.net> <5289d1e7$0$6643$9b4e6d93@newsspool2.arcor-online.net> In-Reply-To: Message-ID: <5289e6b6$0$6623$9b4e6d93@newsspool2.arcor-online.net> Organization: Arcor NNTP-Posting-Date: 18 Nov 2013 11:06:46 CET NNTP-Posting-Host: 26d6a0f6.newsspool2.arcor-online.net X-Trace: DXC=iIR4^]ZhHCWI?44J>Z[:RQA9EHlD; 3YcR4Fo<]lROoRQ8kFejVX[k3<:EhI9ZP9ZCFVCI[JDW X-Complaints-To: usenet-abuse@arcor.de Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Original-Bytes: 4005 Xref: number.nntp.dca.giganews.com comp.lang.ada:183920 Date: 2013-11-18T11:06:46+01:00 List-Id: On 18.11.13 10:01, Dmitry A. Kazakov wrote: >> UTF-8 can actually be so checked (and is checked by typical implementations) > > 1. The share of illegal UTF-8 sequences is negligible. The one among Ada > programs is even less than that. The share of illegal UTF-8 sequences in source text stays low as long as policies prevent use of anything but ASCII. But! OTOH, the difficulty of adapting to use of limited character sets stays high, unnerving, and costly. (I know because the source text used here and elsewhere is full of ASCII-sequences representing ubiquitous Unicode characters. These are characters that users expect to see. If 1234 codes some common international character, then having to write "abc \x{1234}" all over the place is a PITA. The need to write "abc ["1234"]" GNAT style does not change that.) > 2. Latin1 sequences are all legal. Legality of (only) almost all octets interpreted as Latin-1 characters does not make the interpretation of string literals correct. Correctness involves the problem specification, not just Ada. Which is what matters most: The *user*, the raison d'ĂȘtre of programming, is not really satisfied when legal programs will actually malfunction because of legal ambiguity of legal octets. Would anyone be at ease with similar ambiguity of number literals? > Now, carefully observe that the program in question was dealt with as if it > were encoded in Latin1. So much for your theory. My theory involves programmers, foreign software, and users, in addition to the mere formalism that you mention. > --------------- > P.S. In order to make a point you should take a set of legal [and > practical] Ada programs encoded in X and then reinterpreted in Y. Then you > compare how many of them become: 0. useful > 1. illegal > 2. remain legal keeping the semantics > 3. remain legal breaking the semantics Note that legality can always be established together with 0, and automatically is, once programmers can easily specify character encoding to be something unambiguous. The stubbornness of 7bit engineering in OSs and in other circumstances calls for a pragma Source_Text_Encoding (...); With this warning sign in place, both old and new generations of programmers can do their job.