comp.lang.ada
 help / color / mirror / Atom feed
From: Keith Thompson <kst-u@mib.org>
Subject: Re: unicode and wide_text_io
Date: Sun, 31 Dec 2017 13:41:19 -0800
Date: 2017-12-31T13:41:19-08:00	[thread overview]
Message-ID: <lnh8s6bmo0.fsf@kst-u.example.com> (raw)
In-Reply-To: 9e0a433c-2c52-4118-8624-dd7c23496074@googlegroups.com

Robert Eachus <rieachus@comcast.net> writes:
> On Wednesday, December 27, 2017 at 6:58:01 PM UTC-5, Randy Brukardt wrote:
>> "Mehdi Saada" <00120260a@gmail.com> wrote in message 
>> news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com...
>> >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably
>> >> meant output of code points. That is a different beast. Convert a code
>> >> point to UTF-8 string and output that. E.g.
>> > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string 
>> > even represent
>> > codepoints next to the 255th ??
>> 
>> Easy: it uses a variable-width representation.
>> 
>> > I may have a rather very shallow understanding of characters encoding and 
>> > representation,
>> 
>> That's the problem. Unless you can stick to Latin-1, you'll need to fix that 
>> understanding before contining.
>> 
>> In Ada,  type Character = Latin-1 = first 255 code positions, 8-bit 
>> representation. Text_IO and type String are for Latin-1 strings.
>> 
>> type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code 
>> positions = UCS-2 = 16-bit representation.
>
> There is also UTF16 which is identical to Unicode, characters in the
> range 0D800 to 0DFFF are used as escapes to allow more than 65536
> code-points.

Unicode specifies code points, numeric values for each of a large number
of characters.  UTF-8, UTF-16, and UTF-32/UCS-4 are *representations* of
Unicode.  They're all able to represent all Unicode characters, and they
differ in how they do so.  (ASCII, Latin-1, and UCS-2 are
representations of small subsets of Unicode.)

>> type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation.
>
> No, all of UCS-4, everything defined in ISO-10646.

What are you saying "No" to?

>> There is no native support in Ada for UTF-8 or UTF-16 strings. There is a 
>> conversion package (Ada.Strings.Encoding) [which is nasty because it breaks 
>> strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and 
>> Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 
>> (there is no good way to tell between them in the general case).
>> 
>> Windows uses a BOM character at the start of UTF-8 files to differentiate 
>> (at least in programs like Notepad and the built-in edit control), but that 
>> is not recommended by Unicode. I think they would prefer a world where 
>> Latin-1 had disappeared completely, but that of course is not the real 
>> world.
>> 
>> That's probably enough character set info to get you into trouble. ;-)
>
> Mild trouble anyway, no burnings, no heresy trials. The ISO-10646
> standard does favor using the correct BOM at the start of UTF-8, UCS-2
> and UCS-4.  Unicode is an extended version of UCS-2 to include pages
> other than the 10646 BMP (Basic multilingual plane).  Using a BOM with
> Unicode may mislead a program reading the file.  The problem is not
> telling Unicode from UCS-2 when they are different. There no
> differences between Unicode and UCS-2 and unless those extra pages are
> used.  Files in most languages will be identical.  Even Japanese and
> Chinese may not be detectable--unless you omit the BOM for Unicode
> files. ;-)

The above is correct if you replace "Unicode" by "UTF-16".  UCS-2
uses 2 bytes per character, with no mechanism for representation code
points above 65535.  UTF-16 is based on UCS-2, with a mechanism for
using multiple 2-byte sequences to represent code points above 65535.

(In Windows, it's common to refer to Windows-1252 as "ANSI"
and UTF-16 as "Unicode".  Both are incorrect.  Windows-1252 was
submitted to ANSI for standardization, but was never approved.
UTF-16 is a representation of Unicode.)

I don't know what ISO-10646 recommends, but using a BOM with UTF-8
files causes problems on Unix-like systems.  On such systems,
most text files these days are UTF-8 and most do not have a BOM
(because it's not needed; BOM is a byte order mark, and UTF-8 has
no variations in byte ordering).

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

  reply	other threads:[~2017-12-31 21:41 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
2017-12-27 20:04 ` Dmitry A. Kazakov
2017-12-27 21:47   ` Dennis Lee Bieber
2017-12-27 22:32 ` Mehdi Saada
2017-12-27 22:33   ` Mehdi Saada
2017-12-27 22:48     ` Mehdi Saada
2017-12-27 23:32       ` Mehdi Saada
2017-12-27 23:57   ` Randy Brukardt
2017-12-28  5:20     ` Robert Eachus
2017-12-31 21:41       ` Keith Thompson [this message]
2017-12-28  9:04   ` Dmitry A. Kazakov
2017-12-28 11:06     ` Niklas Holsti
2017-12-28 11:50       ` Dmitry A. Kazakov
2017-12-28 13:15 ` Mehdi Saada
2017-12-28 14:25   ` Dmitry A. Kazakov
2017-12-28 14:32     ` Simon Wright
2017-12-28 15:28       ` Niklas Holsti
2017-12-28 15:47         ` 00120260b
2017-12-28 22:35           ` G.B.
2017-12-28 18:15         ` Simon Wright
2017-12-28 22:36 ` Mehdi Saada
2017-12-29  0:51   ` Randy Brukardt
2017-12-30 12:50   ` Björn Lundin
2017-12-30 15:33     ` Dennis Lee Bieber
2017-12-30 15:56       ` Dmitry A. Kazakov
2017-12-30 23:20       ` Björn Lundin
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox