unicode and wide_text

comp.lang.ada
 help / color / mirror / Atom feed

* unicode and wide_text_io
@ 2017-12-27 18:08 Mehdi Saada
  2017-12-27 20:04 ` Dmitry A. Kazakov
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Mehdi Saada @ 2017-12-27 18:08 UTC (permalink / raw)


I would like to avoid rewriting an I-O related package, which would prove tiresome to the end. As it is, it uses UTF8 (so TEXT_IO), but for ONE, only ONE character, I need to "put" WIDE_TEXT_IO. The slash character ⁄ would allow better looking fractions for outputting rationnals, since it's meant to tell the terminal to consider numbers before and after as superscript and subscript, respectively.
Is there a way in unicode in UTF8 to shift outside of UTF8 ?
I doubt so, and saying it like this sounds autocontradictory, but that would be fun, so ... ?

Or else, I write a "put" only for screen ?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
@ 2017-12-27 20:04 ` Dmitry A. Kazakov
  2017-12-27 21:47   ` Dennis Lee Bieber
  2017-12-27 22:32 ` Mehdi Saada
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Dmitry A. Kazakov @ 2017-12-27 20:04 UTC (permalink / raw)


On 2017-12-27 19:08, Mehdi Saada wrote:

> I would like to avoid rewriting an I-O related package, which would
> prove tiresome to the end. As it is, it uses UTF8 (so TEXT_IO),
Ada.Text_IO is Latin-1, at least formally. Use Stream I/O instead if you 
don't want surprises.

> but for ONE, only ONE character, I need to "put" WIDE_TEXT_IO.

No, you don't. Wide Text_IO is UCS-2. Keep on using UTF-8. You probably 
meant output of code points. That is a different beast. Convert a code 
point to UTF-8 string and output that. E.g.:

    function Image (Value : UTF8_Code_Point) return String;

here

    http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8

For example:

    Image (16#F8D0#) & Image (16#F8D3#) & Image (16#F8D0#)

would be "ADA" in Klingon. They seem don't know that the proper spelling 
is "Ada", but what would you expect from them? (:-))

> The slash character ⁄ would allow better looking fractions for
> outputting rationnals, since it's meant to tell the terminal to
> consider numbersbefore and after as superscript and subscript,
> respectively.

Why don't you simply output super- or subscript digits in UTF-8?

   http://www.dmitry-kazakov.de/ada/strings_edit.htm#7.3

Use Image (Number) from the package instance. That is.

> Is there a way in unicode in UTF8 to shift outside of UTF8 ?

I don't understand the meaning of this question.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 20:04 ` Dmitry A. Kazakov
@ 2017-12-27 21:47   ` Dennis Lee Bieber
  0 siblings, 0 replies; 26+ messages in thread
From: Dennis Lee Bieber @ 2017-12-27 21:47 UTC (permalink / raw)


On Wed, 27 Dec 2017 21:04:26 +0100, "Dmitry A. Kazakov"
<mailbox@dmitry-kazakov.de> declaimed the following:

>On 2017-12-27 19:08, Mehdi Saada wrote:
>

>> The slash character ? would allow better looking fractions for
>> outputting rationnals, since it's meant to tell the terminal to
>> consider numbersbefore and after as superscript and subscript,
>> respectively.
>
>Why don't you simply output super- or subscript digits in UTF-8?
>
	Given the OP's phrasing, it almost sounds like they are trying to send
a terminal specific control sequence in which the terminal somehow performs
super/sub scripting on the "numbers" surrounding that sequence.

	Not a feature I recall ever seeing on a terminal... Not on VT100/ANSI
controls, at least.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
  2017-12-27 20:04 ` Dmitry A. Kazakov
@ 2017-12-27 22:32 ` Mehdi Saada
  2017-12-27 22:33   ` Mehdi Saada
                     ` (2 more replies)
  2017-12-28 13:15 ` Mehdi Saada
  2017-12-28 22:36 ` Mehdi Saada
  3 siblings, 3 replies; 26+ messages in thread
From: Mehdi Saada @ 2017-12-27 22:32 UTC (permalink / raw)


> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably 
> meant output of code points. That is a different beast. Convert a code 
> point to UTF-8 string and output that. E.g.
Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string even represent  codepoints next to the 255th ??
Superscripts and subscripts means more change in the IO package.
Before I could simply use the generic Integer_IO, but I have no clue how to do to output a specific code point for each digit in a specific base... wouldn't that mean rewriting part of Integer_IO ?

I may have a rather very shallow understanding of characters encoding and representation, and that's quite an understatement, but you said: "Ada's Character has Latin-1 encoding which differs from UTF-8 in the code positions greater than 127" 
Really ?? You're sayin' there position such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ??
And I know peanuts about the UCS-2 thing. I'm too ignorant for getting one bit of your saying, except it sounds like heresy in the ears of the Ada Church. Burn them all !!
Ada.stream permits output of bits without any formatting, right ? If so, it might do.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 22:32 ` Mehdi Saada
@ 2017-12-27 22:33   ` Mehdi Saada
  2017-12-27 22:48     ` Mehdi Saada
  2017-12-27 23:57   ` Randy Brukardt
  2017-12-28  9:04   ` Dmitry A. Kazakov
  2 siblings, 1 reply; 26+ messages in thread
From: Mehdi Saada @ 2017-12-27 22:33 UTC (permalink / raw)


Le mercredi 27 décembre 2017 23:32:52 UTC+1, Mehdi Saada a écrit :
> > Wide Text_IO is UCS-2. Keep on using UTF-8. You probably 
> > meant output of code points. That is a different beast. Convert a code 
> > point to UTF-8 string and output that. E.g.
> Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string even represent  codepoints next to the 255th ??
> Superscripts and subscripts means more change in the IO package.
> Before I could simply use the generic Integer_IO, but I have no clue how to do to output a specific code point for each digit in a specific base... wouldn't that mean rewriting part of Integer_IO ?
> 
> I may have a rather very shallow understanding of characters encoding and representation, and that's quite an understatement, but you said: "Ada's Character has Latin-1 encoding which differs from UTF-8 in the code positions greater than 127" 
> Really ?? You're sayin' there position such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ??
> And I know peanuts about the UCS-2 thing. I'm too ignorant for getting one bit of your saying, except it sounds like heresy in the ears of the Ada Church. Burn them all !!
> Ada.stream permits output of bytes without any formatting, right ? I never studied streams for now. Sounds too early. But I'll look at it.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 22:33   ` Mehdi Saada
@ 2017-12-27 22:48     ` Mehdi Saada
  2017-12-27 23:32       ` Mehdi Saada
  0 siblings, 1 reply; 26+ messages in thread
From: Mehdi Saada @ 2017-12-27 22:48 UTC (permalink / raw)


I'll say it otherwise: you're speaking Chinese here ^_^. I've looked at streams in the RM, I understand nothing. Way too early.
Plus, wouldn't that be idiotic of me to rely to someone else's package, if the objective was to understand the inside-out of my work ?

> Is there a way in unicode in UTF8 to shift outside of UTF8 ? 
Means to output characters in the unicode standard past the 255th codepoint, but keeping with Ada.String.
How to heck can I output easily a "slash" character ?
If I go with the subscripts/superscripts, I'll have to rewrite the whole IO-package, which a lot of work, and a boring one.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 22:48     ` Mehdi Saada
@ 2017-12-27 23:32       ` Mehdi Saada
  0 siblings, 0 replies; 26+ messages in thread
From: Mehdi Saada @ 2017-12-27 23:32 UTC (permalink / raw)


I finally used ADA.WIDE_TEXT_IO for just the PUT procedure:
   procedure Put (
         Fichier : in     WIDE_TEXT_IO.FILE_TYPE;
         Item    : in     T_Rationnel        ) is

   begin -- Put
      P_Entier_wide.Put(
         File  => Fichier,
         Item  => Numer (Item),
         Width => 1);
      if Denom (Item) /= 1 then
         WIDE_TEXT_IO.Put(
            File => Fichier,
            Item => WIDE_CHARACTER'Val(2044));
         P_Entier_wide.Put(
            File  => Fichier,
            Item  => Denom (Item),
            Width => 1);
      end if;
   end Put;

   procedure Put (
         Item : in     T_Rationnel ) is
   begin -- Put
      Put(Fichier => WIDE_TEXT_IO.Standard_Output,
          Item    => Item);
   end Put;
Why would that be wrong, Dmitry ?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 22:32 ` Mehdi Saada
  2017-12-27 22:33   ` Mehdi Saada
@ 2017-12-27 23:57   ` Randy Brukardt
  2017-12-28  5:20     ` Robert Eachus
  2017-12-28  9:04   ` Dmitry A. Kazakov
  2 siblings, 1 reply; 26+ messages in thread
From: Randy Brukardt @ 2017-12-27 23:57 UTC (permalink / raw)


"Mehdi Saada" <00120260a@gmail.com> wrote in message 
news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com...
>> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably
>> meant output of code points. That is a different beast. Convert a code
>> point to UTF-8 string and output that. E.g.
> Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string 
> even represent
> codepoints next to the 255th ??

Easy: it uses a variable-width representation.

> I may have a rather very shallow understanding of characters encoding and 
> representation,

That's the problem. Unless you can stick to Latin-1, you'll need to fix that 
understanding before contining.

In Ada,  type Character = Latin-1 = first 255 code positions, 8-bit 
representation. Text_IO and type String are for Latin-1 strings.

type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code 
positions = UCS-2 = 16-bit representation.

type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation.

There is no native support in Ada for UTF-8 or UTF-16 strings. There is a 
conversion package (Ada.Strings.Encoding) [which is nasty because it breaks 
strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and 
Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 
(there is no good way to tell between them in the general case).

Windows uses a BOM character at the start of UTF-8 files to differentiate 
(at least in programs like Notepad and the built-in edit control), but that 
is not recommended by Unicode. I think they would prefer a world where 
Latin-1 had disappeared completely, but that of course is not the real 
world.

That's probably enough character set info to get you into trouble. ;-)

                              Randy.



 and that's quite an understatement, but you said: "Ada's Character has 
Latin-1 encoding which differs from UTF-8 in the code positions greater than 
127"
> Really ?? You're sayin' there position such as Wide_Character'Val(X) 
> doesn't correspond to the Xth character in the UNICODE standard ??
> And I know peanuts about the UCS-2 thing. I'm too ignorant for getting one 
> bit of your saying, except it sounds like heresy in the ears of the Ada 
> Church. Burn them all !!
> Ada.stream permits output of bits without any formatting, right ? If so, 
> it might do. 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 23:57   ` Randy Brukardt
@ 2017-12-28  5:20     ` Robert Eachus
  2017-12-31 21:41       ` Keith Thompson
  0 siblings, 1 reply; 26+ messages in thread
From: Robert Eachus @ 2017-12-28  5:20 UTC (permalink / raw)

On Wednesday, December 27, 2017 at 6:58:01 PM UTC-5, Randy Brukardt wrote:
> "Mehdi Saada" <00120260a@gmail.com> wrote in message 
> news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com...
> >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably
> >> meant output of code points. That is a different beast. Convert a code
> >> point to UTF-8 string and output that. E.g.
> > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string 
> > even represent
> > codepoints next to the 255th ??
> 
> Easy: it uses a variable-width representation.
> 
> > I may have a rather very shallow understanding of characters encoding and 
> > representation,
> 
> That's the problem. Unless you can stick to Latin-1, you'll need to fix that 
> understanding before contining.
> 
> In Ada,  type Character = Latin-1 = first 255 code positions, 8-bit 
> representation. Text_IO and type String are for Latin-1 strings.
> 
> type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code 
> positions = UCS-2 = 16-bit representation.

There is also UTF16 which is identical to Unicode, characters in the range 0D800 to 0DFFF are used as escapes to allow more than 65536 code-points. 
> 
> type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation.

No, all of UCS-4, everything defined in ISO-10646.
> 
> There is no native support in Ada for UTF-8 or UTF-16 strings. There is a 
> conversion package (Ada.Strings.Encoding) [which is nasty because it breaks 
> strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and 
> Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 
> (there is no good way to tell between them in the general case).
> 
> Windows uses a BOM character at the start of UTF-8 files to differentiate 
> (at least in programs like Notepad and the built-in edit control), but that 
> is not recommended by Unicode. I think they would prefer a world where 
> Latin-1 had disappeared completely, but that of course is not the real 
> world.
> 
> That's probably enough character set info to get you into trouble. ;-)

Mild trouble anyway, no burnings, no heresy trials. The ISO-10646 standard does favor using the correct BOM at the start of UTF-8, UCS-2 and UCS-4.  Unicode is an extended version of UCS-2 to include pages other than the 10646 BMP (Basic multilingual plane).  Using a BOM with Unicode may mislead a program reading the file.  The problem is not telling Unicode from UCS-2 when they are different. There no differences between Unicode and UCS-2 and unless those extra pages are used.  Files in most languages will be identical.  Even Japanese and Chinese may not be detectable--unless you omit the BOM for Unicode files. ;-)

> > Really ?? You're sayin' there position such as Wide_Character'Val(X) 
> > doesn't correspond to the Xth character in the UNICODE standard ??

Whoo boy, digging a deep hole here. You have to keep in mind that there are at least three character sets that matter when you are programming in Ada (or any other language.)

First, there is the character set that you use to create the program.  The Ada standard provides a default, and it is the one that the compiler tests use. But it is only a default, and GNAT accepts source in different formats. Back when Ada was new, there were compilers for programs written in IBM's EBCDIC.

The second character set you care about (or set of them) are the Ada Character type, and other character types.  In the IBM compiler above Character corresponded to ASCII as expected.  The ordering of character literals was ASCII not EBCDIC, etc.

The third group of character sets are those that correspond to printers, displays and keyboards.  If you need to write code that supports, say Cyrillic terminals, you may end up with strings that are really in say Russian.  Best to gather them all in one "Language" package, to make it easier when you have to do Ukrainian. :-(

If all three character sets are the same, that's nice.  But it can lead to sloppy thinking.   Way back when the ARG was wrestling with this, getting everyone on the same page about which set of character sets we were discussing now, allowed us to get things into reasonable shape going into the Ada 9X development.  You want your compiler to allow Shift-JIS in comments?  Sure.  Just remember that an end of line, and only an end of line terminates a comment.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 22:32 ` Mehdi Saada
  2017-12-27 22:33   ` Mehdi Saada
  2017-12-27 23:57   ` Randy Brukardt
@ 2017-12-28  9:04   ` Dmitry A. Kazakov
  2017-12-28 11:06     ` Niklas Holsti
  2 siblings, 1 reply; 26+ messages in thread
From: Dmitry A. Kazakov @ 2017-12-28  9:04 UTC (permalink / raw)

On 2017-12-27 23:32, Mehdi Saada wrote:

> Fundamentaly, how can a UTF8 string even represent  codepoints next to the 255th ??

UTF-8 uses a chain code to represent large integers. ASCII 7-bit is 
coded as-as. Other characters require more than one octet. It is a 
technique widely used in communication for lossless compression. The 
drawback is that you cannot directly index characters in an UTF-8 
string. But virtually no text processing algorithm need that. So not a 
loss, actual.

In short, representation unit (octet) /= represented thing (character).

> Superscripts and subscripts means more change in the IO package.
> Before I could simply use the generic Integer_IO, but I have no clue 
> how to do to output a specific code point for each digit in a
> specific  base... wouldn't that mean rewriting part of Integer_IO ?

You mean the standard library Integer_IO? Sure, you will have to replace it.

> I may have a rather very shallow understanding of characters
> encoding and representation, and that's quite an understatement, but
> you said: "Ada's Character has Latin-1 encoding which differs from
> UTF-8 in the  code positions greater than 127"
> Really ??

Yep. Latin-1 and UTF-8 have different representation. Both have ASCII 
7-bit as a subset.

> You're sayin' there position such as Wide_Character'Val(X)
> doesn't correspond to the Xth character in the UNICODE standard ??

Character = Latin-1
Wide_Character = UCS-2
Wide_Wide_Character = UCS-4

Linux uses UTF-8 (for a long time). Windows uses either ASCII (so-called 
A-calls) or UTF-16 (so-called W-calls). There was a time, long ago, when 
Windows used UCS-2, but then they ditched it for UTF-16.

Now, Ada programmers insolently ignore the standard and pragmatically use:

Character = representation unit of UTF-8 (octet)
Wide_Character = representation unit of UTF-16
Wide_Wide_Character = UNICODE code point

This works most of the time, but one should be careful.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28  9:04   ` Dmitry A. Kazakov
@ 2017-12-28 11:06     ` Niklas Holsti
  2017-12-28 11:50       ` Dmitry A. Kazakov
  0 siblings, 1 reply; 26+ messages in thread
From: Niklas Holsti @ 2017-12-28 11:06 UTC (permalink / raw)

On 17-12-28 11:04 , Dmitry A. Kazakov wrote:
> On 2017-12-27 23:32, Mehdi Saada wrote:
    [snip]
>> Superscripts and subscripts means more change in the IO package.
>> Before I could simply use the generic Integer_IO, but I have no clue
>> how to do to output a specific code point for each digit in a
>> specific  base... wouldn't that mean rewriting part of Integer_IO ?
>
> You mean the standard library Integer_IO? Sure, you will have to replace
> it.

It seems simpler to continue using Integer_IO, but to Put the number 
into a String, and then translate the digits in the resulting String 
into superscript or subscript form, as desired.

The translation for decimal digits 0..9 seems quite simple 
(https://en.wikipedia.org/wiki/Superscripts_and_Subscripts).

Using the Unicode "fraction slash" seems less reliable, to judge from 
the hints in 
https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts: "Some 
browsers support this".

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28 11:06     ` Niklas Holsti
@ 2017-12-28 11:50       ` Dmitry A. Kazakov
  0 siblings, 0 replies; 26+ messages in thread
From: Dmitry A. Kazakov @ 2017-12-28 11:50 UTC (permalink / raw)


On 2017-12-28 12:06, Niklas Holsti wrote:
> On 17-12-28 11:04 , Dmitry A. Kazakov wrote:
>> On 2017-12-27 23:32, Mehdi Saada wrote:
>     [snip]
>>> Superscripts and subscripts means more change in the IO package.
>>> Before I could simply use the generic Integer_IO, but I have no clue
>>> how to do to output a specific code point for each digit in a
>>> specific  base... wouldn't that mean rewriting part of Integer_IO ?
>>
>> You mean the standard library Integer_IO? Sure, you will have to replace
>> it.
> 
> It seems simpler to continue using Integer_IO, but to Put the number 
> into a String, and then translate the digits in the resulting String 
> into superscript or subscript form, as desired.

Translating integer into decimal digits is arguably easier than 
conversion of ASCII codes for decimal digits (and sign) into UTF-8 
subscript and superscript chains of octets.

And the procedures and functions for sub-/superscript string I/O are 
ready. No need to rewrite them.

BTW, Integer_IO is quite uncomfortable to use with strings. This was the 
reason why I redesigned its interface as:

    procedure Put
              (  Destination : in out String;
                 Pointer     : in out Integer;
                 Value       : Number'Base;
                 Base        : NumberBase := 10;
                 PutPlus     : Boolean    := False;
                 Field       : Natural    := 0;
                 Justify     : Alignment  := Left;
                 Fill        : Character  := ' '
              );

instead of:

    procedure Put
              (  To   : out String;
                 Item : in Num;
                 Base : in Number_Base := Default_Base
              );

which requires trimming and thus has little advantage over plain Num'Image.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
  2017-12-27 20:04 ` Dmitry A. Kazakov
  2017-12-27 22:32 ` Mehdi Saada
@ 2017-12-28 13:15 ` Mehdi Saada
  2017-12-28 14:25   ` Dmitry A. Kazakov
  2017-12-28 22:36 ` Mehdi Saada
  3 siblings, 1 reply; 26+ messages in thread
From: Mehdi Saada @ 2017-12-28 13:15 UTC (permalink / raw)


Ok, I'm done with it. It sure is interesting, but I don't want to even think about all this stuff for the time being... Talk about "universal standard", when it's (apparently) it's far from universal or uniform !
> Easy: it uses a variable-width representation. 
Under the assumption terminals will be able to display it... well, whatever I use in the end, I've got to suppose it anyway.
I'll probably stick with Latin-1, if it doesn't look as nice as intended. If so I'll forget about slash or what not.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28 13:15 ` Mehdi Saada
@ 2017-12-28 14:25   ` Dmitry A. Kazakov
  2017-12-28 14:32     ` Simon Wright
  0 siblings, 1 reply; 26+ messages in thread
From: Dmitry A. Kazakov @ 2017-12-28 14:25 UTC (permalink / raw)

On 2017-12-28 14:15, Mehdi Saada wrote:

> Ok, I'm done with it. It sure is interesting, but I don't want to
> even think about all this stuff for the time being... Talk about
> "universal  standard", when it's (apparently) it's far from universal or uniform !

It is. Everybody uses UTF-8. Even under Windows. The text is converted 
from/to UTF-16 right after or before passing it to the system call. All 
processing is UTF-8. E.g. GTK uses UTF-8 consistently no matter what OS.

>> Easy: it uses a variable-width representation.
> Under the assumption terminals will be able to display it... well,
> whatever I use in the end, I've got to suppose it anyway.

Sure they are Linux and Windows.

Take this program:
------------------------------------
with Ada.Text_IO;  use Ada.Text_IO;
procedure Superscript is
begin
    Put_Line
    (  "Superscript 1="
    &  Character'Val (194)
    &  Character'Val (185)
    );
end Superscript;
------------------------------------
Start Windows console:

 > gnatmake superscript.adb
 > chcp 65001
 > superscript

This will, depending on the font, nicely output:

Superscript 1=¹

P.S. Batch command chcp selects the code page of the console. 65001 is 
for UTF-8.

P.P.S. Some Windows fonts do not have sub-/superscript glyphs. So you 
might wish to set the console to Lucida or some other fixed space font 
with Unicode support.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28 14:25   ` Dmitry A. Kazakov
@ 2017-12-28 14:32     ` Simon Wright
  2017-12-28 15:28       ` Niklas Holsti
  0 siblings, 1 reply; 26+ messages in thread
From: Simon Wright @ 2017-12-28 14:32 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

> On 2017-12-28 14:15, Mehdi Saada wrote:
>
>> Ok, I'm done with it. It sure is interesting, but I don't want to
>> even think about all this stuff for the time being... Talk about
>> "universal standard", when it's (apparently) it's far from universal
>> or uniform !
>
> It is. Everybody uses UTF-8. Even under Windows. The text is converted
> from/to UTF-16 right after or before passing it to the system
> call. All processing is UTF-8. E.g. GTK uses UTF-8 consistently no
> matter what OS.
>
>>> Easy: it uses a variable-width representation.
>> Under the assumption terminals will be able to display it... well,
>> whatever I use in the end, I've got to suppose it anyway.
>
> Sure they are Linux and Windows.
>
> Take this program:
> ------------------------------------
> with Ada.Text_IO;  use Ada.Text_IO;
> procedure Superscript is
> begin
>    Put_Line
>    (  "Superscript 1="
>    &  Character'Val (194)
>    &  Character'Val (185)
>    );
> end Superscript;
> ------------------------------------

works fine on macOS (no chcp messing needed!)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28 14:32     ` Simon Wright
@ 2017-12-28 15:28       ` Niklas Holsti
  2017-12-28 15:47         ` 00120260b
  2017-12-28 18:15         ` Simon Wright
  0 siblings, 2 replies; 26+ messages in thread
From: Niklas Holsti @ 2017-12-28 15:28 UTC (permalink / raw)


On 17-12-28 16:32 , Simon Wright wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>
>> On 2017-12-28 14:15, Mehdi Saada wrote:
>>
>>> Ok, I'm done with it. It sure is interesting, but I don't want to
>>> even think about all this stuff for the time being... Talk about
>>> "universal standard", when it's (apparently) it's far from universal
>>> or uniform !
>>
>> It is. Everybody uses UTF-8. Even under Windows. The text is converted
>> from/to UTF-16 right after or before passing it to the system
>> call. All processing is UTF-8. E.g. GTK uses UTF-8 consistently no
>> matter what OS.
>>
>>>> Easy: it uses a variable-width representation.
>>> Under the assumption terminals will be able to display it... well,
>>> whatever I use in the end, I've got to suppose it anyway.
>>
>> Sure they are Linux and Windows.
>>
>> Take this program:
>> ------------------------------------
>> with Ada.Text_IO;  use Ada.Text_IO;
>> procedure Superscript is
>> begin
>>    Put_Line
>>    (  "Superscript 1="
>>    &  Character'Val (194)
>>    &  Character'Val (185)
>>    );
>> end Superscript;
>> ------------------------------------
>
> works fine on macOS (no chcp messing needed!)

Depends on the Preferences (-> Settings -> Advanced: Character encoding) 
you set for the Mac Terminal program. While UTF-8 is one of the 
available encodings, I normally have it set to Latin-1, to match Ada and 
GNAT. Latin-1 is fine for the languages I mostly use (English, Swedish, 
Finnish).

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28 15:28       ` Niklas Holsti
@ 2017-12-28 15:47         ` 00120260b
  2017-12-28 22:35           ` G.B.
  2017-12-28 18:15         ` Simon Wright
  1 sibling, 1 reply; 26+ messages in thread
From: 00120260b @ 2017-12-28 15:47 UTC (permalink / raw)


Then, how come the norm hasn't made it a bit easier to input/ouput post-latin-1 characters ? Why aren't other norms/characters set/encodings more like special cases ?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28 15:28       ` Niklas Holsti
  2017-12-28 15:47         ` 00120260b
@ 2017-12-28 18:15         ` Simon Wright
  1 sibling, 0 replies; 26+ messages in thread
From: Simon Wright @ 2017-12-28 18:15 UTC (permalink / raw)


Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

> On 17-12-28 16:32 , Simon Wright wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:

>>> Take this program:
>>> ------------------------------------
>>> with Ada.Text_IO;  use Ada.Text_IO;
>>> procedure Superscript is
>>> begin
>>>    Put_Line
>>>    (  "Superscript 1="
>>>    &  Character'Val (194)
>>>    &  Character'Val (185)
>>>    );
>>> end Superscript;
>>> ------------------------------------
>>
>> works fine on macOS (no chcp messing needed!)
>
> Depends on the Preferences (-> Settings -> Advanced: Character
> encoding) you set for the Mac Terminal program. While UTF-8 is one of
> the available encodings, I normally have it set to Latin-1, to match
> Ada and GNAT. Latin-1 is fine for the languages I mostly use (English,
> Swedish, Finnish).

I dare say you can do something similar under Linux?

The setting is in Preferences -> Profiles -> Advanced on High Sierra,
and Unicode (UTF-8) is what I have set; I don't recall changing it, so I
guss it's the default.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28 15:47         ` 00120260b
@ 2017-12-28 22:35           ` G.B.
  0 siblings, 0 replies; 26+ messages in thread
From: G.B. @ 2017-12-28 22:35 UTC (permalink / raw)


On 28.12.17 16:47, 00120260b@gmail.com wrote:
> Then, how come the norm hasn't made it a bit easier to input/ouput post-latin-1 characters ? Why aren't other norms/characters set/encodings more like special cases ?
> 

Actually, output of non-7-bit, unambiguously encoded text
has been made reasonably easy, I'd say, also defaulting
to what should be expected:

with Ada.Wide_Text_IO.Text_Streams;
with Ada.Strings.UTF_Encoding.Wide_Strings;

procedure UTF is
    --  USD/EUR, i.e. "$/€"
    Ratio : constant Wide_String := "$/" & Wide_Character'Val (16#20AC#);

    use Ada.Wide_Text_Io, Ada.Strings;
begin
    Put_Line (Ratio); --  use defaults, traditional
    String'Write --  stream output, force UTF-8
      (Text_Streams.Stream (Current_Output),
       UTF_Encoding.Wide_Strings.Encode (Ratio));
end UTF;

The above source text uses only 7 bit encoding for post-
latin-1 strings. Only comment text is using a wide_character.

If, instead, source text is encoded by "more" bits, and using
post-latin-1 literals or identifiers, then the compiler
may need to be told. I think that BOMs may be of use, and
in any case, there are compiler switches or some other
vendor specific vocabulary describing source text.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
                   ` (2 preceding siblings ...)
  2017-12-28 13:15 ` Mehdi Saada
@ 2017-12-28 22:36 ` Mehdi Saada
  2017-12-29  0:51   ` Randy Brukardt
  2017-12-30 12:50   ` Björn Lundin
  3 siblings, 2 replies; 26+ messages in thread
From: Mehdi Saada @ 2017-12-28 22:36 UTC (permalink / raw)


I took some time to read here and there on the topics of encoding, character sets, unicode, what is UTF8,16 and 32, little and big endian, BOM, etc.
Now I've done that, your comments Dmitry sounds accurate, and it turned out I really knew nothing about [ban]characters[/ban]/glyphs/code points.
Wasn't so complicated in the end. I'll look at your work in no time. Since I long to work in the area of interface and commandline utilities, the sooner I learn all about characters, the better. Thanks for your explanation, you guys ;-)

Myself:
> there are positions such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ??
Of course: Character'val(156) to 'val(255) are one byte long, whereas in UTF8 the corresponding code points are encoded with two bytes. Did I understood the lesson ?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28 22:36 ` Mehdi Saada
@ 2017-12-29  0:51   ` Randy Brukardt
  2017-12-30 12:50   ` Björn Lundin
  1 sibling, 0 replies; 26+ messages in thread
From: Randy Brukardt @ 2017-12-29  0:51 UTC (permalink / raw)


"Mehdi Saada" <00120260a@gmail.com> wrote in message 
news:023dc29b-dbc5-4fc8-b44f-d748517adec8@googlegroups.com...
...
> Myself:
>> there are positions such as Wide_Character'Val(X) doesn't correspond to 
>> the Xth character in the UNICODE standard ??
> Of course: Character'val(156) to 'val(255) are one byte long, whereas in 
> UTF8 the corresponding code points are encoded with two bytes. Did I 
> understood the lesson ?

Yup, that's right. And it depends on what the display device is handling as 
to whether UTF-8 is recognized. If you don't include Dmitry's chcp command 
on Windows, most likely his program will output garbage. (On my computer, 
the default code page is 437, which certainly won't display UTF-8 strings!)

                                          Randy.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28 22:36 ` Mehdi Saada
  2017-12-29  0:51   ` Randy Brukardt
@ 2017-12-30 12:50   ` Björn Lundin
  2017-12-30 15:33     ` Dennis Lee Bieber
  1 sibling, 1 reply; 26+ messages in thread
From: Björn Lundin @ 2017-12-30 12:50 UTC (permalink / raw)


On 2017-12-28 23:36, Mehdi Saada wrote:
> Myself:
>> there are positions such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ??
> Of course: Character'val(156) to 'val(255) are one byte long, whereas in UTF8 the corresponding code points are encoded with two bytes. Did I understood the lesson ?

Yes - if it fits into 2 bytes. if not UTF-8 uses 3 and 4 bytes instead.
So UTF-8 can use codepoints up to 32 bits (ca 4 billion)

codepoint between
1     -> 2**8  -1 = 1 byte
2**8  -> 2**16 -1 = 2 bytes
2**16 -> 2**24 -1 = 3 bytes
2**24 -> 2**32 -1 = 4 bytes

-- 
--
Björn

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-30 12:50   ` Björn Lundin
@ 2017-12-30 15:33     ` Dennis Lee Bieber
  2017-12-30 15:56       ` Dmitry A. Kazakov
  2017-12-30 23:20       ` Björn Lundin
  0 siblings, 2 replies; 26+ messages in thread
From: Dennis Lee Bieber @ 2017-12-30 15:33 UTC (permalink / raw)


On Sat, 30 Dec 2017 13:50:37 +0100, Björn Lundin <b.f.lundin@gmail.com>
declaimed the following:

>On 2017-12-28 23:36, Mehdi Saada wrote:
>> Myself:
>>> there are positions such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ??
>> Of course: Character'val(156) to 'val(255) are one byte long, whereas in UTF8 the corresponding code points are encoded with two bytes. Did I understood the lesson ?
>
>Yes - if it fits into 2 bytes. if not UTF-8 uses 3 and 4 bytes instead.
>So UTF-8 can use codepoints up to 32 bits (ca 4 billion)
>
>codepoint between
>1     -> 2**8  -1 = 1 byte

	Isn't that 0..2^7... Any byte with the MSB set is a multibyte code (and
number of MSB bits set before a 0 bit indicates how many bytes).

>2**8  -> 2**16 -1 = 2 bytes
>2**16 -> 2**24 -1 = 3 bytes
>2**24 -> 2**32 -1 = 4 bytes
>
>-- 
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-30 15:33     ` Dennis Lee Bieber
@ 2017-12-30 15:56       ` Dmitry A. Kazakov
  2017-12-30 23:20       ` Björn Lundin
  1 sibling, 0 replies; 26+ messages in thread
From: Dmitry A. Kazakov @ 2017-12-30 15:56 UTC (permalink / raw)


On 2017-12-30 16:33, Dennis Lee Bieber wrote:

> 	Isn't that 0..2^7... Any byte with the MSB set is a multibyte code (and
> number of MSB bits set before a 0 bit indicates how many bytes).

Yes. Furthermore, the subsequent octets have MSB set. The reason for 
this "waste" is to allow bidirectional scanning of UTF-8 strings.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-30 15:33     ` Dennis Lee Bieber
  2017-12-30 15:56       ` Dmitry A. Kazakov
@ 2017-12-30 23:20       ` Björn Lundin
  1 sibling, 0 replies; 26+ messages in thread
From: Björn Lundin @ 2017-12-30 23:20 UTC (permalink / raw)


On 2017-12-30 16:33, Dennis Lee Bieber wrote:
>> codepoint between
>> 1     -> 2**8  -1 = 1 byte
> 	Isn't that 0..2^7... Any byte with the MSB set is a multibyte code (and
> number of MSB bits set before a 0 bit indicates how many bytes).
> 
>> 2**8  -> 2**16 -1 = 2 bytes
>> 2**16 -> 2**24 -1 = 3 bytes
>> 2**24 -> 2**32 -1 = 4 bytes

You are probably right,
I meant to point out the principle.
That UTF-8 can be more that 2 bytes.
That it expands as needed up to 4 bytes.

-- 
--
Björn


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: unicode and wide_text_io
  2017-12-28  5:20     ` Robert Eachus
@ 2017-12-31 21:41       ` Keith Thompson
  0 siblings, 0 replies; 26+ messages in thread
From: Keith Thompson @ 2017-12-31 21:41 UTC (permalink / raw)


Robert Eachus <rieachus@comcast.net> writes:
> On Wednesday, December 27, 2017 at 6:58:01 PM UTC-5, Randy Brukardt wrote:
>> "Mehdi Saada" <00120260a@gmail.com> wrote in message 
>> news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com...
>> >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably
>> >> meant output of code points. That is a different beast. Convert a code
>> >> point to UTF-8 string and output that. E.g.
>> > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string 
>> > even represent
>> > codepoints next to the 255th ??
>> 
>> Easy: it uses a variable-width representation.
>> 
>> > I may have a rather very shallow understanding of characters encoding and 
>> > representation,
>> 
>> That's the problem. Unless you can stick to Latin-1, you'll need to fix that 
>> understanding before contining.
>> 
>> In Ada,  type Character = Latin-1 = first 255 code positions, 8-bit 
>> representation. Text_IO and type String are for Latin-1 strings.
>> 
>> type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code 
>> positions = UCS-2 = 16-bit representation.
>
> There is also UTF16 which is identical to Unicode, characters in the
> range 0D800 to 0DFFF are used as escapes to allow more than 65536
> code-points.

Unicode specifies code points, numeric values for each of a large number
of characters.  UTF-8, UTF-16, and UTF-32/UCS-4 are *representations* of
Unicode.  They're all able to represent all Unicode characters, and they
differ in how they do so.  (ASCII, Latin-1, and UCS-2 are
representations of small subsets of Unicode.)

>> type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation.
>
> No, all of UCS-4, everything defined in ISO-10646.

What are you saying "No" to?

>> There is no native support in Ada for UTF-8 or UTF-16 strings. There is a 
>> conversion package (Ada.Strings.Encoding) [which is nasty because it breaks 
>> strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and 
>> Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 
>> (there is no good way to tell between them in the general case).
>> 
>> Windows uses a BOM character at the start of UTF-8 files to differentiate 
>> (at least in programs like Notepad and the built-in edit control), but that 
>> is not recommended by Unicode. I think they would prefer a world where 
>> Latin-1 had disappeared completely, but that of course is not the real 
>> world.
>> 
>> That's probably enough character set info to get you into trouble. ;-)
>
> Mild trouble anyway, no burnings, no heresy trials. The ISO-10646
> standard does favor using the correct BOM at the start of UTF-8, UCS-2
> and UCS-4.  Unicode is an extended version of UCS-2 to include pages
> other than the 10646 BMP (Basic multilingual plane).  Using a BOM with
> Unicode may mislead a program reading the file.  The problem is not
> telling Unicode from UCS-2 when they are different. There no
> differences between Unicode and UCS-2 and unless those extra pages are
> used.  Files in most languages will be identical.  Even Japanese and
> Chinese may not be detectable--unless you omit the BOM for Unicode
> files. ;-)

The above is correct if you replace "Unicode" by "UTF-16".  UCS-2
uses 2 bytes per character, with no mechanism for representation code
points above 65535.  UTF-16 is based on UCS-2, with a mechanism for
using multiple 2-byte sequences to represent code points above 65535.

(In Windows, it's common to refer to Windows-1252 as "ANSI"
and UTF-16 as "Unicode".  Both are incorrect.  Windows-1252 was
submitted to ANSI for standardization, but was never approved.
UTF-16 is a representation of Unicode.)

I don't know what ISO-10646 recommends, but using a BOM with UTF-8
files causes problems on Unix-like systems.  On such systems,
most text files these days are UTF-8 and most do not have a BOM
(because it's not needed; BOM is a byte order mark, and UTF-8 has
no variations in byte ordering).

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-12-31 21:41 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
2017-12-27 20:04 ` Dmitry A. Kazakov
2017-12-27 21:47   ` Dennis Lee Bieber
2017-12-27 22:32 ` Mehdi Saada
2017-12-27 22:33   ` Mehdi Saada
2017-12-27 22:48     ` Mehdi Saada
2017-12-27 23:32       ` Mehdi Saada
2017-12-27 23:57   ` Randy Brukardt
2017-12-28  5:20     ` Robert Eachus
2017-12-31 21:41       ` Keith Thompson
2017-12-28  9:04   ` Dmitry A. Kazakov
2017-12-28 11:06     ` Niklas Holsti
2017-12-28 11:50       ` Dmitry A. Kazakov
2017-12-28 13:15 ` Mehdi Saada
2017-12-28 14:25   ` Dmitry A. Kazakov
2017-12-28 14:32     ` Simon Wright
2017-12-28 15:28       ` Niklas Holsti
2017-12-28 15:47         ` 00120260b
2017-12-28 22:35           ` G.B.
2017-12-28 18:15         ` Simon Wright
2017-12-28 22:36 ` Mehdi Saada
2017-12-29  0:51   ` Randy Brukardt
2017-12-30 12:50   ` Björn Lundin
2017-12-30 15:33     ` Dennis Lee Bieber
2017-12-30 15:56       ` Dmitry A. Kazakov
2017-12-30 23:20       ` Björn Lundin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox