Re: unicode and wide_text_io

comp.lang.ada
 help / color / mirror / Atom feed

From: Robert Eachus <rieachus@comcast.net>
Subject: Re: unicode and wide_text_io
Date: Wed, 27 Dec 2017 21:20:51 -0800 (PST)
Date: 2017-12-27T21:20:51-08:00	[thread overview]
Message-ID: <9e0a433c-2c52-4118-8624-dd7c23496074@googlegroups.com> (raw)
In-Reply-To: <p21c28$880$1@franka.jacob-sparre.dk>

On Wednesday, December 27, 2017 at 6:58:01 PM UTC-5, Randy Brukardt wrote:
> "Mehdi Saada" <00120260a@gmail.com> wrote in message 
> news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com...
> >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably
> >> meant output of code points. That is a different beast. Convert a code
> >> point to UTF-8 string and output that. E.g.
> > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string 
> > even represent
> > codepoints next to the 255th ??
> 
> Easy: it uses a variable-width representation.
> 
> > I may have a rather very shallow understanding of characters encoding and 
> > representation,
> 
> That's the problem. Unless you can stick to Latin-1, you'll need to fix that 
> understanding before contining.
> 
> In Ada,  type Character = Latin-1 = first 255 code positions, 8-bit 
> representation. Text_IO and type String are for Latin-1 strings.
> 
> type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code 
> positions = UCS-2 = 16-bit representation.

There is also UTF16 which is identical to Unicode, characters in the range 0D800 to 0DFFF are used as escapes to allow more than 65536 code-points. 
> 
> type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation.

No, all of UCS-4, everything defined in ISO-10646.
> 
> There is no native support in Ada for UTF-8 or UTF-16 strings. There is a 
> conversion package (Ada.Strings.Encoding) [which is nasty because it breaks 
> strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and 
> Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 
> (there is no good way to tell between them in the general case).
> 
> Windows uses a BOM character at the start of UTF-8 files to differentiate 
> (at least in programs like Notepad and the built-in edit control), but that 
> is not recommended by Unicode. I think they would prefer a world where 
> Latin-1 had disappeared completely, but that of course is not the real 
> world.
> 
> That's probably enough character set info to get you into trouble. ;-)

Mild trouble anyway, no burnings, no heresy trials. The ISO-10646 standard does favor using the correct BOM at the start of UTF-8, UCS-2 and UCS-4.  Unicode is an extended version of UCS-2 to include pages other than the 10646 BMP (Basic multilingual plane).  Using a BOM with Unicode may mislead a program reading the file.  The problem is not telling Unicode from UCS-2 when they are different. There no differences between Unicode and UCS-2 and unless those extra pages are used.  Files in most languages will be identical.  Even Japanese and Chinese may not be detectable--unless you omit the BOM for Unicode files. ;-)

> > Really ?? You're sayin' there position such as Wide_Character'Val(X) 
> > doesn't correspond to the Xth character in the UNICODE standard ??

Whoo boy, digging a deep hole here. You have to keep in mind that there are at least three character sets that matter when you are programming in Ada (or any other language.)

First, there is the character set that you use to create the program.  The Ada standard provides a default, and it is the one that the compiler tests use. But it is only a default, and GNAT accepts source in different formats. Back when Ada was new, there were compilers for programs written in IBM's EBCDIC.

The second character set you care about (or set of them) are the Ada Character type, and other character types.  In the IBM compiler above Character corresponded to ASCII as expected.  The ordering of character literals was ASCII not EBCDIC, etc.

The third group of character sets are those that correspond to printers, displays and keyboards.  If you need to write code that supports, say Cyrillic terminals, you may end up with strings that are really in say Russian.  Best to gather them all in one "Language" package, to make it easier when you have to do Ukrainian. :-(

If all three character sets are the same, that's nice.  But it can lead to sloppy thinking.   Way back when the ARG was wrestling with this, getting everyone on the same page about which set of character sets we were discussing now, allowed us to get things into reasonable shape going into the Ada 9X development.  You want your compiler to allow Shift-JIS in comments?  Sure.  Just remember that an end of line, and only an end of line terminates a comment.

next prev parent reply	other threads:[~2017-12-28  5:20 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
2017-12-27 20:04 ` Dmitry A. Kazakov
2017-12-27 21:47   ` Dennis Lee Bieber
2017-12-27 22:32 ` Mehdi Saada
2017-12-27 22:33   ` Mehdi Saada
2017-12-27 22:48     ` Mehdi Saada
2017-12-27 23:32       ` Mehdi Saada
2017-12-27 23:57   ` Randy Brukardt
2017-12-28  5:20     ` Robert Eachus [this message]
2017-12-31 21:41       ` Keith Thompson
2017-12-28  9:04   ` Dmitry A. Kazakov
2017-12-28 11:06     ` Niklas Holsti
2017-12-28 11:50       ` Dmitry A. Kazakov
2017-12-28 13:15 ` Mehdi Saada
2017-12-28 14:25   ` Dmitry A. Kazakov
2017-12-28 14:32     ` Simon Wright
2017-12-28 15:28       ` Niklas Holsti
2017-12-28 15:47         ` 00120260b
2017-12-28 22:35           ` G.B.
2017-12-28 18:15         ` Simon Wright
2017-12-28 22:36 ` Mehdi Saada
2017-12-29  0:51   ` Randy Brukardt
2017-12-30 12:50   ` Björn Lundin
2017-12-30 15:33     ` Dennis Lee Bieber
2017-12-30 15:56       ` Dmitry A. Kazakov
2017-12-30 23:20       ` Björn Lundin

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox