comp.lang.ada
 help / color / mirror / Atom feed
From: Stoik <staszek.goldstein@gmail.com>
Subject: Re: strange behaviour of utf-8 files
Date: Sun, 17 Nov 2013 03:12:21 -0800 (PST)
Date: 2013-11-17T03:12:21-08:00	[thread overview]
Message-ID: <7464679c-6b98-4e23-a337-83b671473553@googlegroups.com> (raw)
In-Reply-To: <z2fwn0g0hlr3$.1bktkfuljfy6b.dlg@40tude.net>

W dniu sobota, 16 listopada 2013 16:57:56 UTC+1 użytkownik Dmitry A. Kazakov napisał:
> On Sat, 16 Nov 2013 07:12:20 -0800 (PST), Stoik wrote:
> 
> 
> 
> > By the way, nothing changes if I use wide_character and wide_string
> 
> > instead of character and string. Even if character=octet, certainly
> 
> > wide_character is not an octet!
> 
> 
> 
> String = Latin1
> 
> Wide_String = UCS-2
> 
> 
> 
> There is no built-in type for UTF-8, though customary one uses String for
> 
> it (and Wide_String for UTF-16).
> 
> 
> 
> -- 
> 
> Regards,
> 
> Dmitry A. Kazakov
> 
> http://www.dmitry-kazakov.de

Thanks for your comments. It is obviously a question of having a different encoding in the editor and the compiler. I forgot to add the -gnatW8 switch to the compiler (this should be a default, I believe). Nevertheless, there still are some misunderstanding connected with string, wide_string and wide_wide_string. They do not correspond to any encodings, they just correspond to character repertoires of the encodings you mentioned. String to the first 256 characters from Unicode (or ISO-10646), wide_string to BMP, and wide_wide_string to the whole Unicode. In particular, wide_string can be encoded internally using any of utf-8,16,32, the programmer does not need to know anything about it. 

I do not believe one should avoid using characters from outside ASCII in the source code. I tried it in Python and Java with no problems whatsoever. Using some strange constants instead of usual glyphs for characters outside ASCII when using subprograms from ada.(wide_)strings.maps, for example to_mapping, would be gruesome. 

In any case, GNAT is prepared to deal with the problem properly, although the number of steps the user must remember about is a bit too high (setting environment variable charset to utf-8, choosing utf-8 in the source editor,adding -gnatW8 to the compiler switches and -W8 to pretty printer switches. And the UTF-8 is the only encoding that solves the problem of non-Latin1 characters at all.

Regards


  reply	other threads:[~2013-11-17 11:12 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-16 13:12 strange behaviour of utf-8 files Stoik
2013-11-16 13:34 ` Dmitry A. Kazakov
2013-11-16 15:09   ` Stoik
2013-11-16 15:55     ` Dmitry A. Kazakov
2013-11-17 13:32       ` Georg Bauhaus
2013-11-17 14:07         ` Dmitry A. Kazakov
2013-11-17 17:19           ` Dennis Lee Bieber
2013-11-17 18:07             ` Dmitry A. Kazakov
2013-11-17 19:05           ` Georg Bauhaus
2013-11-17 20:38             ` Dmitry A. Kazakov
2013-11-18  8:38               ` Georg Bauhaus
2013-11-18  9:01                 ` Dmitry A. Kazakov
2013-11-18 10:06                   ` Georg Bauhaus
2013-11-18  8:44               ` Georg Bauhaus
2013-11-18 10:24                 ` Dmitry A. Kazakov
2013-11-18 13:05                   ` G.B.
2013-11-18 15:25                     ` Dmitry A. Kazakov
2013-11-18 15:51                       ` G.B.
2013-11-18 17:34                         ` Dmitry A. Kazakov
2013-11-18  0:34           ` Stoik
2013-11-16 17:01     ` Georg Bauhaus
2013-11-17 10:38       ` Stoik
2013-11-16 15:12   ` Stoik
2013-11-16 15:57     ` Dmitry A. Kazakov
2013-11-17 11:12       ` Stoik [this message]
2013-11-22  1:03         ` Randy Brukardt
2013-11-22  3:02           ` Shark8
2013-11-22 11:54             ` Georg Bauhaus
2013-11-23  4:14             ` Randy Brukardt
2013-12-06  2:17               ` Georg Bauhaus
2013-11-16 20:06     ` Peter C. Chapin
2013-11-17 10:34       ` Stoik
2013-11-22  0:53       ` Randy Brukardt
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox