From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.66.161.195 with SMTP id xu3mr6466545pab.33.1384686741918; Sun, 17 Nov 2013 03:12:21 -0800 (PST) X-Received: by 10.49.127.177 with SMTP id nh17mr1031qeb.30.1384686741852; Sun, 17 Nov 2013 03:12:21 -0800 (PST) Path: border1.nntp.ams.giganews.com!nntp.giganews.com!feeder.erje.net!eu.feeder.erje.net!news.glorb.com!y3no9543871pbx.0!news-out.google.com!9ni33120qaf.0!nntp.google.com!i2no3871243qav.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Sun, 17 Nov 2013 03:12:21 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=31.183.18.217; posting-account=fc1UmgoAAADREbhuD8e4smj7nsEdRFz9 NNTP-Posting-Host: 31.183.18.217 References: <73e0853b-454a-467f-9dc7-84ca5b9c29b2@googlegroups.com> <1ghx537y5gbfq.17oazom68d4n6.dlg@40tude.net> <5bf1b290-70bc-4240-b27c-120ce6b0b840@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <7464679c-6b98-4e23-a337-83b671473553@googlegroups.com> Subject: Re: strange behaviour of utf-8 files From: Stoik Injection-Date: Sun, 17 Nov 2013 11:12:21 +0000 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Xref: number.nntp.dca.giganews.com comp.lang.ada:183908 Date: 2013-11-17T03:12:21-08:00 List-Id: W dniu sobota, 16 listopada 2013 16:57:56 UTC+1 u=C5=BCytkownik Dmitry A. K= azakov napisa=C5=82: > On Sat, 16 Nov 2013 07:12:20 -0800 (PST), Stoik wrote: >=20 >=20 >=20 > > By the way, nothing changes if I use wide_character and wide_string >=20 > > instead of character and string. Even if character=3Doctet, certainly >=20 > > wide_character is not an octet! >=20 >=20 >=20 > String =3D Latin1 >=20 > Wide_String =3D UCS-2 >=20 >=20 >=20 > There is no built-in type for UTF-8, though customary one uses String for >=20 > it (and Wide_String for UTF-16). >=20 >=20 >=20 > --=20 >=20 > Regards, >=20 > Dmitry A. Kazakov >=20 > http://www.dmitry-kazakov.de Thanks for your comments. It is obviously a question of having a different = encoding in the editor and the compiler. I forgot to add the -gnatW8 switch= to the compiler (this should be a default, I believe). Nevertheless, there= still are some misunderstanding connected with string, wide_string and wid= e_wide_string. They do not correspond to any encodings, they just correspon= d to character repertoires of the encodings you mentioned. String to the fi= rst 256 characters from Unicode (or ISO-10646), wide_string to BMP, and wid= e_wide_string to the whole Unicode. In particular, wide_string can be encod= ed internally using any of utf-8,16,32, the programmer does not need to kno= w anything about it.=20 I do not believe one should avoid using characters from outside ASCII in th= e source code. I tried it in Python and Java with no problems whatsoever. U= sing some strange constants instead of usual glyphs for characters outside = ASCII when using subprograms from ada.(wide_)strings.maps, for example to_m= apping, would be gruesome.=20 In any case, GNAT is prepared to deal with the problem properly, although t= he number of steps the user must remember about is a bit too high (setting = environment variable charset to utf-8, choosing utf-8 in the source editor,= adding -gnatW8 to the compiler switches and -W8 to pretty printer switches.= And the UTF-8 is the only encoding that solves the problem of non-Latin1 c= haracters at all. Regards