From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail From: Keith Thompson Newsgroups: comp.lang.ada Subject: Re: unicode and wide_text_io Date: Sun, 31 Dec 2017 13:41:19 -0800 Organization: None to speak of Message-ID: References: <9e0a433c-2c52-4118-8624-dd7c23496074@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: reader02.eternal-september.org; posting-host="51e114b865b7e4e183940686936e726b"; logging-data="25053"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19RQnqc98RmbuQdhslrhlSQ" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) Cancel-Lock: sha1:Ckwy5m0jISCl9O/EldyQCBsAMpk= sha1:EHUkKyHRiOaxFMh0Lwd1NeAkhSQ= Xref: reader02.eternal-september.org comp.lang.ada:49714 Date: 2017-12-31T13:41:19-08:00 List-Id: Robert Eachus writes: > On Wednesday, December 27, 2017 at 6:58:01 PM UTC-5, Randy Brukardt wrote: >> "Mehdi Saada" <00120260a@gmail.com> wrote in message >> news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com... >> >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably >> >> meant output of code points. That is a different beast. Convert a code >> >> point to UTF-8 string and output that. E.g. >> > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string >> > even represent >> > codepoints next to the 255th ?? >> >> Easy: it uses a variable-width representation. >> >> > I may have a rather very shallow understanding of characters encoding and >> > representation, >> >> That's the problem. Unless you can stick to Latin-1, you'll need to fix that >> understanding before contining. >> >> In Ada, type Character = Latin-1 = first 255 code positions, 8-bit >> representation. Text_IO and type String are for Latin-1 strings. >> >> type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code >> positions = UCS-2 = 16-bit representation. > > There is also UTF16 which is identical to Unicode, characters in the > range 0D800 to 0DFFF are used as escapes to allow more than 65536 > code-points. Unicode specifies code points, numeric values for each of a large number of characters. UTF-8, UTF-16, and UTF-32/UCS-4 are *representations* of Unicode. They're all able to represent all Unicode characters, and they differ in how they do so. (ASCII, Latin-1, and UCS-2 are representations of small subsets of Unicode.) >> type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation. > > No, all of UCS-4, everything defined in ISO-10646. What are you saying "No" to? >> There is no native support in Ada for UTF-8 or UTF-16 strings. There is a >> conversion package (Ada.Strings.Encoding) [which is nasty because it breaks >> strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and >> Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 >> (there is no good way to tell between them in the general case). >> >> Windows uses a BOM character at the start of UTF-8 files to differentiate >> (at least in programs like Notepad and the built-in edit control), but that >> is not recommended by Unicode. I think they would prefer a world where >> Latin-1 had disappeared completely, but that of course is not the real >> world. >> >> That's probably enough character set info to get you into trouble. ;-) > > Mild trouble anyway, no burnings, no heresy trials. The ISO-10646 > standard does favor using the correct BOM at the start of UTF-8, UCS-2 > and UCS-4. Unicode is an extended version of UCS-2 to include pages > other than the 10646 BMP (Basic multilingual plane). Using a BOM with > Unicode may mislead a program reading the file. The problem is not > telling Unicode from UCS-2 when they are different. There no > differences between Unicode and UCS-2 and unless those extra pages are > used. Files in most languages will be identical. Even Japanese and > Chinese may not be detectable--unless you omit the BOM for Unicode > files. ;-) The above is correct if you replace "Unicode" by "UTF-16". UCS-2 uses 2 bytes per character, with no mechanism for representation code points above 65535. UTF-16 is based on UCS-2, with a mechanism for using multiple 2-byte sequences to represent code points above 65535. (In Windows, it's common to refer to Windows-1252 as "ANSI" and UTF-16 as "Unicode". Both are incorrect. Windows-1252 was submitted to ANSI for standardization, but was never approved. UTF-16 is a representation of Unicode.) I don't know what ISO-10646 recommends, but using a BOM with UTF-8 files causes problems on Unix-like systems. On such systems, most text files these days are UTF-8 and most do not have a BOM (because it's not needed; BOM is a byte order mark, and UTF-8 has no variations in byte ordering). -- Keith Thompson (The_Other_Keith) kst-u@mib.org Working, but not speaking, for JetHead Development, Inc. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"