From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!newsfeed.datemas.de!uucp.gnuu.de!newsfeed.arcor.de!newsspool2.arcor-online.net!news.arcor.de.POSTED!not-for-mail
Date: Mon, 18 Nov 2013 09:38:06 +0100
From: Georg Bauhaus <rm.dash-bauhaus@futureapps.de>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:24.0) Gecko/20100101 Thunderbird/24.0.1
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: strange behaviour of utf-8 files
References: <73e0853b-454a-467f-9dc7-84ca5b9c29b2@googlegroups.com>
 <1ghx537y5gbfq.17oazom68d4n6.dlg@40tude.net>
 <9d00683c-949c-4e88-a161-ebd78b350d39@googlegroups.com>
 <1w23uq33ul2i8$.wzjpp3evot36.dlg@40tude.net>
 <5288c584$0$6639$9b4e6d93@newsspool2.arcor-online.net>
 <k4o32tkp8lu4.1fuul1c08z90n$.dlg@40tude.net>
 <52891372$0$6636$9b4e6d93@newsspool2.arcor-online.net>
 <10ec0vuld83fy.1t7bduzwsrfe.dlg@40tude.net>
In-Reply-To: <10ec0vuld83fy.1t7bduzwsrfe.dlg@40tude.net>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Message-ID: <5289d1e7$0$6643$9b4e6d93@newsspool2.arcor-online.net>
Organization: Arcor
NNTP-Posting-Date: 18 Nov 2013 09:37:59 CET
NNTP-Posting-Host: 111a9a45.newsspool2.arcor-online.net
X-Trace: 
 DXC=c5gH08P;I4g<6cDJZfMd_cA9EHlD;3Ycb4Fo<]lROoRa8kF<OcfhCOkXG44Z`h@GQoPCY\c7>ejVh[k3<:EhI9Z`1VN=@H2gSnl
X-Complaints-To: usenet-abuse@arcor.de
Xref: news.eternal-september.org comp.lang.ada:17713
Date: 2013-11-18T09:37:59+01:00
List-Id: <comp.lang.ada>

On 17.11.13 21:38, Dmitry A. Kazakov wrote:
> The problem is that the common part (ASCII) is sufficient for Ada
> programming while the varying part is subtle enough to cause difficult to
> detect bugs in string literals. Bugs that cannot be detected by the
> compiler.

UTF-8 can actually be so checked (and is checked by typical implementations)
that accidentally mistaking some octets of a string literal for Latin-1
coded characters is impossible: this is a consequence of the design of
UTF-8, as you know: the {1}+0 prefix rules.

Actually, a compiler---GNAT having a helpful spell checker already---could
detect occurrences in string literals of

    String'(N   => Character'Val (195),
            N+1 => Character'Val (179))

as very likely being the valid UTF-8 sequence representing "ó". It will
then emit a warning saying that source text might be UTF-8 rather than
Latin-1, and suggest a compiler switch accordingly. Of course, the presence
of a BOM can add further support to this warning.