From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!newsfeed.datemas.de!uucp.gnuu.de!newsfeed.arcor.de!newsspool2.arcor-online.net!news.arcor.de.POSTED!not-for-mail Date: Mon, 18 Nov 2013 09:38:06 +0100 From: Georg Bauhaus User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: strange behaviour of utf-8 files References: <73e0853b-454a-467f-9dc7-84ca5b9c29b2@googlegroups.com> <1ghx537y5gbfq.17oazom68d4n6.dlg@40tude.net> <9d00683c-949c-4e88-a161-ebd78b350d39@googlegroups.com> <1w23uq33ul2i8$.wzjpp3evot36.dlg@40tude.net> <5288c584$0$6639$9b4e6d93@newsspool2.arcor-online.net> <52891372$0$6636$9b4e6d93@newsspool2.arcor-online.net> <10ec0vuld83fy.1t7bduzwsrfe.dlg@40tude.net> In-Reply-To: <10ec0vuld83fy.1t7bduzwsrfe.dlg@40tude.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Message-ID: <5289d1e7$0$6643$9b4e6d93@newsspool2.arcor-online.net> Organization: Arcor NNTP-Posting-Date: 18 Nov 2013 09:37:59 CET NNTP-Posting-Host: 111a9a45.newsspool2.arcor-online.net X-Trace: DXC=c5gH08P;I4g<6cDJZfMd_cA9EHlD;3Ycb4Fo<]lROoRa8kFejVh[k3<:EhI9Z`1VN=@H2gSnl X-Complaints-To: usenet-abuse@arcor.de Xref: news.eternal-september.org comp.lang.ada:17713 Date: 2013-11-18T09:37:59+01:00 List-Id: On 17.11.13 21:38, Dmitry A. Kazakov wrote: > The problem is that the common part (ASCII) is sufficient for Ada > programming while the varying part is subtle enough to cause difficult to > detect bugs in string literals. Bugs that cannot be detected by the > compiler. UTF-8 can actually be so checked (and is checked by typical implementations) that accidentally mistaking some octets of a string literal for Latin-1 coded characters is impossible: this is a consequence of the design of UTF-8, as you know: the {1}+0 prefix rules. Actually, a compiler---GNAT having a helpful spell checker already---could detect occurrences in string literals of String'(N => Character'Val (195), N+1 => Character'Val (179)) as very likely being the valid UTF-8 sequence representing "รณ". It will then emit a warning saying that source text might be UTF-8 rather than Latin-1, and suggest a compiler switch accordingly. Of course, the presence of a BOM can add further support to this warning.