comp.lang.ada
 help / color / mirror / Atom feed
From: Adam Beneschan <adam@irvine.com>
Subject: Re: Interpretation of extensions different from Unix/Linux?
Date: Wed, 19 Aug 2009 09:10:09 -0700 (PDT)
Date: 2009-08-19T09:10:09-07:00	[thread overview]
Message-ID: <b3995dcd-11bd-4101-94b7-0d154c0ad47d@l35g2000pra.googlegroups.com> (raw)
In-Reply-To: h6f44m$d6n$1@munin.nbi.dk

On Aug 18, 1:48 pm, "Randy Brukardt" <ra...@rrsoftware.com> wrote:
> "Adam Beneschan" <a...@irvine.com> wrote in message
>
> news:6f80c882-fa03-4ca9-a53e-fae34cea160d@b15g2000yqd.googlegroups.com...
> On Aug 17, 3:28 pm, "Randy Brukardt" <ra...@rrsoftware.com> wrote:
>
> >> The problem here is that String really is not the right type, but since
> >> you
> >> can't have string literals for private types in Ada, you can't make it a
> >> private type. (And if you could have string literals, it still couldn't
> >> be
> >> used with the existing I/O packages, it would be way too incompatible.)
>
> >That wouldn't even be an issue if UTF-8 were strictly a "storage
> >format" as you called it above.  If that were the case, you wouldn't
> >need string literals for it.  I think the problem is that UTF-8 is
> >something of a hybrid.  If all characters in the string are in the
> >32..126 range, the "sequence of octets" stored in the UTF-8 string is
> >identical to the graphic characters stored in a String.  (UTF-8 was
> >designed purposefully so that would happen.)  In cases like that, it
> >makes sense to use a string literal.
>
> Well, the problem here is that it *always* makes sense to use a string
> literal. That's how you specify what you want in storage in Ada.

We might be talking about two different things.  If I had a type whose
purpose was to represent the storage of a string in Huffman code, it
wouldn't make sense to specify its value with a string literal,
because the octets in that storage wouldn't have any direct
relationship with any meaningful graphic characters.  So specifying
the value of such a type with a string literal wouldn't make sense,
except via a conversion routine of some sort.  UTF-8 is different
because in many cases the octets being stored *do* have a direct
relationship with the graphic characters.  That's what I was trying to
get at.  But it's possible that I've lost the plot.


> >Also, I'm afraid that using String can backfire.  If I understand it
> >correctly, the decision was that the Name parameter of Text_IO.Open
> >should be interpreted as a UTF-8 octet sequence even though it's a
> >String.  But the intent is to allow string literals.  At some point,
> >though, some poor innocent programmer in Germany or Spain is going to
> >try to use a string literal (or a Latin-1 string variable) with an
> >umlaut or an accented vowel in it and get totally screwed up since
> >those characters don't represent themselves in UTF-8 encoding, and
> >they'll end up puzzling over how their program created a file with a
> >Chinese character in the middle of the name.  (Yeah, I know, that's
> >very unlikely; most likely the UTF-8 encoding will simply be invalid.)
>
> I've been presuming that UTF-8 encoding started with a BOM or something like
> that, else you couldn't tell it from regular Latin-1 encoding. It would be
> hard to insert a BOM into a string literal by accident!
>
> But I do agree that this issue needs some discussion.

My understanding, which could be wrong, was that the UTF-8 standard
specified how to encode a single "character" (from the entire set of
almost 2**32 possibilities) as a sequence of 8-bit octets.  I wasn't
aware that the standard also said anything about special octet
sequences that precede the encoding of a character *sequence*.  Maybe
it does; I'll have to go back and look.

I don't have a lot of experience working with UTF-8 encoded files (on
Windows, say) to know how they're stored or how applications
distinguish the encodings.  In mail messages, though, I sometimes see
contents and attachments that are encoded in UTF-8, and some that are
encoded in Latin-1.  (My mail reader is pretty primitive so I can get
a good look at exactly how the message is encoded.)  In the UTF-8
messages, I don't see any special characters or anything at the
beginning of the contents; in that case, I think it's information in
the mail message headers that tells the application whether to
interpret the string as encoded in UTF-8 or Latin-1.

... OK, after a bit of research, it looks as though the use of BOM to
distinguish UTF-8 from other encodings isn't part of an official
"standard", but it's widely used by Windows applications.  But I also
note that in the Wikipedia entry for "byte-order mark", they recommend
*against* this practice on Unix-type systems (for example, if a shell
script were stored in this manner, the first character would no longer
be recognized as '!' by utilities that checked for shell scripts).  I
don't know whether this advice is obsolete or not.  Anyway, perhaps
this means that the UTF_Encoding package shouldn't assume the use of
BOM to distinguish UTF-8-encoded strings, but perhaps the routines
there should provide an optional parameter to specify what
"protocol" (if any) is used to distinguish between encodings (a
protocol of None would mean that we assume the calling program knows
by some other means what kind of encoding is being used, and the
UTF_Encoding package just converts single characters according to the
standard without worrying about any kind of header sequence).

                                        -- Adam






  parent reply	other threads:[~2009-08-19 16:10 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-01 17:53 Interpretation of extensions different from Unix/Linux? vlc
2009-08-02 17:13 ` Jacob Sparre Andersen
2009-08-04 11:31   ` vlc
2009-08-04 11:44     ` Jacob Sparre Andersen
2009-08-04 11:57       ` Georg Bauhaus
2009-08-04 12:29         ` vlc
2009-08-04 13:43         ` Dmitry A. Kazakov
2009-08-14  4:33           ` Randy Brukardt
2009-08-14  7:37             ` Dmitry A. Kazakov
2009-08-04 12:25       ` vlc
2009-08-04 19:18         ` Jeffrey R. Carter
2009-08-04 19:52           ` Dmitry A. Kazakov
2009-08-04 20:45             ` Jeffrey R. Carter
2009-08-04 21:22               ` Dmitry A. Kazakov
2009-08-04 22:04                 ` Jeffrey R. Carter
2009-08-05  8:33                   ` Dmitry A. Kazakov
2009-08-05 16:07                     ` Jeffrey R. Carter
2009-08-05 16:35                       ` Dmitry A. Kazakov
2009-08-05 17:49                         ` Jeffrey R. Carter
2009-08-05 18:16                           ` Dmitry A. Kazakov
2009-08-05 19:27                             ` Jeffrey R. Carter
2009-08-05 19:50                               ` Dmitry A. Kazakov
2009-08-05 20:46                                 ` Jeffrey R. Carter
2009-08-06  7:43                                   ` Dmitry A. Kazakov
2009-08-05 21:33                               ` Robert A Duff
2009-08-05 19:45                           ` vlc
2009-08-05 19:56                             ` Dmitry A. Kazakov
2009-08-14  4:56                     ` Randy Brukardt
2009-08-14  8:01                       ` Dmitry A. Kazakov
2009-08-14 23:02                         ` Adam Beneschan
2009-08-14 23:54                         ` Randy Brukardt
2009-08-15  8:10                           ` Dmitry A. Kazakov
2009-08-15 12:49                             ` Pascal Obry
2009-08-15 13:23                               ` Dmitry A. Kazakov
2009-08-15 15:11                                 ` Pascal Obry
2009-08-15 17:11                                   ` Dmitry A. Kazakov
2009-08-15 20:07                                     ` Pascal Obry
2009-08-16  7:26                                       ` Dmitry A. Kazakov
2009-08-17 22:28                             ` Randy Brukardt
2009-08-18  0:32                               ` Adam Beneschan
2009-08-18 20:48                                 ` Randy Brukardt
2009-08-19  4:08                                   ` stefan-lucks
2009-08-19 22:01                                     ` Randy Brukardt
2009-08-19  7:37                                   ` Jean-Pierre Rosen
2009-08-19 16:10                                   ` Adam Beneschan [this message]
2009-08-19 22:11                                     ` Randy Brukardt
2009-08-18  7:48                               ` Dmitry A. Kazakov
2009-08-18 20:37                                 ` Randy Brukardt
2009-08-19  8:04                                   ` Dmitry A. Kazakov
2009-08-19 10:32                                     ` Georg Bauhaus
2009-08-19 12:11                                       ` Dmitry A. Kazakov
2009-08-19 15:21                                         ` Georg Bauhaus
2009-08-19 22:40                                     ` Randy Brukardt
2009-08-20  8:00                                       ` Variable- and fixed-length-character strings (Was: Interpretation of extensions different from Unix/Linux?) Jacob Sparre Andersen
2009-08-20 19:40                                       ` Interpretation of extensions different from Unix/Linux? Dmitry A. Kazakov
2009-08-21  0:08                                         ` Randy Brukardt
2009-08-21  7:43                                           ` Dmitry A. Kazakov
2009-08-21 22:10                                             ` Randy Brukardt
2009-08-22  7:27                                               ` Dmitry A. Kazakov
2009-09-01  1:50                                                 ` Randy Brukardt
2009-09-01  7:28                                                   ` Dmitry A. Kazakov
2009-09-02  3:41                                                     ` Stephen Leake
2009-09-02  7:17                                                       ` Dmitry A. Kazakov
2009-09-02 19:49                                                         ` tmoran
2009-09-03  7:41                                                           ` Dmitry A. Kazakov
2009-09-03 17:27                                                             ` tmoran
2009-09-03 20:44                                                               ` Dmitry A. Kazakov
2009-09-03 22:22                                                                 ` Randy Brukardt
2009-09-04  7:40                                                                   ` Dmitry A. Kazakov
2009-09-05  1:58                                                                     ` Randy Brukardt
2009-09-05  2:08                                                                     ` Randy Brukardt
2009-09-05  8:59                                                                       ` Dmitry A. Kazakov
2009-08-21 10:11                                           ` Enumeration of network shared under Windows (was: Interpretation of extensions different from Unix/Linux?) Dmitry A. Kazakov
2009-08-15 16:01                           ` Interpretation of extensions different from Unix/Linux? Vadim Godunko
2009-08-16 13:13                           ` Stephen Leake
2009-08-14  4:46                 ` Randy Brukardt
2009-08-14  9:00                   ` Dmitry A. Kazakov
2009-08-04 21:19           ` vlc
2009-08-14  5:19     ` Randy Brukardt
2009-08-14  6:13       ` Wilcards in Linux (was: Interpretation of extensions different from Unix/Linux?) stefan-lucks
2009-08-14  6:24         ` stefan-lucks
2009-08-14 10:05         ` Wilcards in Linux Markus Schoepflin
2009-08-14 10:22           ` Ludovic Brenta
2009-08-14 18:20             ` Tero Koskinen
2009-08-19 20:39       ` Interpretation of extensions different from Unix/Linux? Keith Thompson
2009-08-19 22:09         ` Robert A Duff
2009-08-20  7:49           ` Jacob Sparre Andersen
2009-08-20 15:56             ` Adam Beneschan
2009-08-20 21:58               ` sjw
2009-08-20 19:44             ` Robert A Duff
2009-08-20 21:34               ` Adam Beneschan
2009-08-20 22:03                 ` (see below)
2009-08-21  0:55                 ` tmoran
2009-08-20 23:55               ` Randy Brukardt
2009-08-21 17:58               ` Keith Thompson
2009-08-21 18:34                 ` Dmitry A. Kazakov
2009-08-21 19:32                 ` Jeffrey R. Carter
2009-08-21 21:34                 ` Robert A Duff
2009-08-21 22:06                   ` Hyman Rosen
2009-08-24 19:51                   ` Keith Thompson
2009-08-28  0:27                     ` Robert A Duff
2009-08-28 13:15                       ` Anders Wirzenius
2009-08-28 15:02                         ` Robert A Duff
2009-08-21  8:45             ` Stephen Leake
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox