comp.lang.ada
 help / color / mirror / Atom feed
From: Georg Bauhaus <rm.dash-bauhaus@futureapps.de>
Subject: Re: strange behaviour of utf-8 files
Date: Mon, 18 Nov 2013 11:06:45 +0100
Date: 2013-11-18T11:06:46+01:00	[thread overview]
Message-ID: <5289e6b6$0$6623$9b4e6d93@newsspool2.arcor-online.net> (raw)
In-Reply-To: <l7bus5vigc0g$.1t5p3ok0bbpo4$.dlg@40tude.net>

On 18.11.13 10:01, Dmitry A. Kazakov wrote:

>> UTF-8 can actually be so checked (and is checked by typical implementations)
>
> 1. The share of illegal UTF-8 sequences is negligible. The one among Ada
> programs is even less than that.

The share of illegal UTF-8 sequences in source text stays low as
long as policies prevent use of anything but ASCII. But! OTOH,
the difficulty of adapting to use of limited character sets stays
high, unnerving, and costly.

(I know because the source text used here and elsewhere is full of
ASCII-sequences representing ubiquitous Unicode characters. These are
characters that users expect to see. If 1234 codes some common
international character, then having to write

   "abc \x{1234}"
all over the place is a PITA. The need to write

   "abc ["1234"]"
GNAT style does not change that.)

> 2. Latin1 sequences are all legal.

Legality of (only) almost all octets interpreted as Latin-1 characters
does not make the interpretation of string literals correct.
Correctness involves the problem specification, not just Ada.

Which is what matters most: The *user*, the raison d'être of programming,
is not really satisfied when legal programs will actually malfunction
because of legal ambiguity of legal octets. Would anyone be at ease with
similar ambiguity of number literals?

> Now, carefully observe that the program in question was dealt with as if it
> were encoded in Latin1. So much for your theory.

My theory involves programmers, foreign software, and users,
in addition to the mere formalism that you mention.

> ---------------
> P.S. In order to make a point you should take a set of legal [and
> practical] Ada programs encoded in X and then reinterpreted in Y. Then you
> compare how many of them become:

0. useful

> 1. illegal
> 2. remain legal keeping the semantics
> 3. remain legal breaking the semantics

Note that legality can always be established together with 0,
and automatically is, once programmers can easily  specify
character encoding to be something unambiguous.

The stubbornness of 7bit engineering in OSs and in other circumstances
calls for a

   pragma Source_Text_Encoding (...);

With this warning sign in place, both old and new generations of programmers
can do their job.



  reply	other threads:[~2013-11-18 10:06 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-16 13:12 strange behaviour of utf-8 files Stoik
2013-11-16 13:34 ` Dmitry A. Kazakov
2013-11-16 15:09   ` Stoik
2013-11-16 15:55     ` Dmitry A. Kazakov
2013-11-17 13:32       ` Georg Bauhaus
2013-11-17 14:07         ` Dmitry A. Kazakov
2013-11-17 17:19           ` Dennis Lee Bieber
2013-11-17 18:07             ` Dmitry A. Kazakov
2013-11-17 19:05           ` Georg Bauhaus
2013-11-17 20:38             ` Dmitry A. Kazakov
2013-11-18  8:38               ` Georg Bauhaus
2013-11-18  9:01                 ` Dmitry A. Kazakov
2013-11-18 10:06                   ` Georg Bauhaus [this message]
2013-11-18  8:44               ` Georg Bauhaus
2013-11-18 10:24                 ` Dmitry A. Kazakov
2013-11-18 13:05                   ` G.B.
2013-11-18 15:25                     ` Dmitry A. Kazakov
2013-11-18 15:51                       ` G.B.
2013-11-18 17:34                         ` Dmitry A. Kazakov
2013-11-18  0:34           ` Stoik
2013-11-16 17:01     ` Georg Bauhaus
2013-11-17 10:38       ` Stoik
2013-11-16 15:12   ` Stoik
2013-11-16 15:57     ` Dmitry A. Kazakov
2013-11-17 11:12       ` Stoik
2013-11-22  1:03         ` Randy Brukardt
2013-11-22  3:02           ` Shark8
2013-11-22 11:54             ` Georg Bauhaus
2013-11-23  4:14             ` Randy Brukardt
2013-12-06  2:17               ` Georg Bauhaus
2013-11-16 20:06     ` Peter C. Chapin
2013-11-17 10:34       ` Stoik
2013-11-22  0:53       ` Randy Brukardt
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox