* unicode and wide_text_io @ 2017-12-27 18:08 Mehdi Saada 2017-12-27 20:04 ` Dmitry A. Kazakov ` (3 more replies) 0 siblings, 4 replies; 26+ messages in thread From: Mehdi Saada @ 2017-12-27 18:08 UTC (permalink / raw) I would like to avoid rewriting an I-O related package, which would prove tiresome to the end. As it is, it uses UTF8 (so TEXT_IO), but for ONE, only ONE character, I need to "put" WIDE_TEXT_IO. The slash character ⁄ would allow better looking fractions for outputting rationnals, since it's meant to tell the terminal to consider numbers before and after as superscript and subscript, respectively. Is there a way in unicode in UTF8 to shift outside of UTF8 ? I doubt so, and saying it like this sounds autocontradictory, but that would be fun, so ... ? Or else, I write a "put" only for screen ? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 18:08 unicode and wide_text_io Mehdi Saada @ 2017-12-27 20:04 ` Dmitry A. Kazakov 2017-12-27 21:47 ` Dennis Lee Bieber 2017-12-27 22:32 ` Mehdi Saada ` (2 subsequent siblings) 3 siblings, 1 reply; 26+ messages in thread From: Dmitry A. Kazakov @ 2017-12-27 20:04 UTC (permalink / raw) On 2017-12-27 19:08, Mehdi Saada wrote: > I would like to avoid rewriting an I-O related package, which would > prove tiresome to the end. As it is, it uses UTF8 (so TEXT_IO), Ada.Text_IO is Latin-1, at least formally. Use Stream I/O instead if you don't want surprises. > but for ONE, only ONE character, I need to "put" WIDE_TEXT_IO. No, you don't. Wide Text_IO is UCS-2. Keep on using UTF-8. You probably meant output of code points. That is a different beast. Convert a code point to UTF-8 string and output that. E.g.: function Image (Value : UTF8_Code_Point) return String; here http://www.dmitry-kazakov.de/ada/strings_edit.htm#Strings_Edit.UTF8 For example: Image (16#F8D0#) & Image (16#F8D3#) & Image (16#F8D0#) would be "ADA" in Klingon. They seem don't know that the proper spelling is "Ada", but what would you expect from them? (:-)) > The slash character ⁄ would allow better looking fractions for > outputting rationnals, since it's meant to tell the terminal to > consider numbersbefore and after as superscript and subscript, > respectively. Why don't you simply output super- or subscript digits in UTF-8? http://www.dmitry-kazakov.de/ada/strings_edit.htm#7.3 Use Image (Number) from the package instance. That is. > Is there a way in unicode in UTF8 to shift outside of UTF8 ? I don't understand the meaning of this question. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 20:04 ` Dmitry A. Kazakov @ 2017-12-27 21:47 ` Dennis Lee Bieber 0 siblings, 0 replies; 26+ messages in thread From: Dennis Lee Bieber @ 2017-12-27 21:47 UTC (permalink / raw) On Wed, 27 Dec 2017 21:04:26 +0100, "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> declaimed the following: >On 2017-12-27 19:08, Mehdi Saada wrote: > >> The slash character ? would allow better looking fractions for >> outputting rationnals, since it's meant to tell the terminal to >> consider numbersbefore and after as superscript and subscript, >> respectively. > >Why don't you simply output super- or subscript digits in UTF-8? > Given the OP's phrasing, it almost sounds like they are trying to send a terminal specific control sequence in which the terminal somehow performs super/sub scripting on the "numbers" surrounding that sequence. Not a feature I recall ever seeing on a terminal... Not on VT100/ANSI controls, at least. -- Wulfraed Dennis Lee Bieber AF6VN wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/ ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 18:08 unicode and wide_text_io Mehdi Saada 2017-12-27 20:04 ` Dmitry A. Kazakov @ 2017-12-27 22:32 ` Mehdi Saada 2017-12-27 22:33 ` Mehdi Saada ` (2 more replies) 2017-12-28 13:15 ` Mehdi Saada 2017-12-28 22:36 ` Mehdi Saada 3 siblings, 3 replies; 26+ messages in thread From: Mehdi Saada @ 2017-12-27 22:32 UTC (permalink / raw) > Wide Text_IO is UCS-2. Keep on using UTF-8. You probably > meant output of code points. That is a different beast. Convert a code > point to UTF-8 string and output that. E.g. Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string even represent codepoints next to the 255th ?? Superscripts and subscripts means more change in the IO package. Before I could simply use the generic Integer_IO, but I have no clue how to do to output a specific code point for each digit in a specific base... wouldn't that mean rewriting part of Integer_IO ? I may have a rather very shallow understanding of characters encoding and representation, and that's quite an understatement, but you said: "Ada's Character has Latin-1 encoding which differs from UTF-8 in the code positions greater than 127" Really ?? You're sayin' there position such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ?? And I know peanuts about the UCS-2 thing. I'm too ignorant for getting one bit of your saying, except it sounds like heresy in the ears of the Ada Church. Burn them all !! Ada.stream permits output of bits without any formatting, right ? If so, it might do. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 22:32 ` Mehdi Saada @ 2017-12-27 22:33 ` Mehdi Saada 2017-12-27 22:48 ` Mehdi Saada 2017-12-27 23:57 ` Randy Brukardt 2017-12-28 9:04 ` Dmitry A. Kazakov 2 siblings, 1 reply; 26+ messages in thread From: Mehdi Saada @ 2017-12-27 22:33 UTC (permalink / raw) Le mercredi 27 décembre 2017 23:32:52 UTC+1, Mehdi Saada a écrit : > > Wide Text_IO is UCS-2. Keep on using UTF-8. You probably > > meant output of code points. That is a different beast. Convert a code > > point to UTF-8 string and output that. E.g. > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string even represent codepoints next to the 255th ?? > Superscripts and subscripts means more change in the IO package. > Before I could simply use the generic Integer_IO, but I have no clue how to do to output a specific code point for each digit in a specific base... wouldn't that mean rewriting part of Integer_IO ? > > I may have a rather very shallow understanding of characters encoding and representation, and that's quite an understatement, but you said: "Ada's Character has Latin-1 encoding which differs from UTF-8 in the code positions greater than 127" > Really ?? You're sayin' there position such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ?? > And I know peanuts about the UCS-2 thing. I'm too ignorant for getting one bit of your saying, except it sounds like heresy in the ears of the Ada Church. Burn them all !! > Ada.stream permits output of bytes without any formatting, right ? I never studied streams for now. Sounds too early. But I'll look at it. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 22:33 ` Mehdi Saada @ 2017-12-27 22:48 ` Mehdi Saada 2017-12-27 23:32 ` Mehdi Saada 0 siblings, 1 reply; 26+ messages in thread From: Mehdi Saada @ 2017-12-27 22:48 UTC (permalink / raw) I'll say it otherwise: you're speaking Chinese here ^_^. I've looked at streams in the RM, I understand nothing. Way too early. Plus, wouldn't that be idiotic of me to rely to someone else's package, if the objective was to understand the inside-out of my work ? > Is there a way in unicode in UTF8 to shift outside of UTF8 ? Means to output characters in the unicode standard past the 255th codepoint, but keeping with Ada.String. How to heck can I output easily a "slash" character ? If I go with the subscripts/superscripts, I'll have to rewrite the whole IO-package, which a lot of work, and a boring one. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 22:48 ` Mehdi Saada @ 2017-12-27 23:32 ` Mehdi Saada 0 siblings, 0 replies; 26+ messages in thread From: Mehdi Saada @ 2017-12-27 23:32 UTC (permalink / raw) I finally used ADA.WIDE_TEXT_IO for just the PUT procedure: procedure Put ( Fichier : in WIDE_TEXT_IO.FILE_TYPE; Item : in T_Rationnel ) is begin -- Put P_Entier_wide.Put( File => Fichier, Item => Numer (Item), Width => 1); if Denom (Item) /= 1 then WIDE_TEXT_IO.Put( File => Fichier, Item => WIDE_CHARACTER'Val(2044)); P_Entier_wide.Put( File => Fichier, Item => Denom (Item), Width => 1); end if; end Put; procedure Put ( Item : in T_Rationnel ) is begin -- Put Put(Fichier => WIDE_TEXT_IO.Standard_Output, Item => Item); end Put; Why would that be wrong, Dmitry ? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 22:32 ` Mehdi Saada 2017-12-27 22:33 ` Mehdi Saada @ 2017-12-27 23:57 ` Randy Brukardt 2017-12-28 5:20 ` Robert Eachus 2017-12-28 9:04 ` Dmitry A. Kazakov 2 siblings, 1 reply; 26+ messages in thread From: Randy Brukardt @ 2017-12-27 23:57 UTC (permalink / raw) "Mehdi Saada" <00120260a@gmail.com> wrote in message news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com... >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably >> meant output of code points. That is a different beast. Convert a code >> point to UTF-8 string and output that. E.g. > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string > even represent > codepoints next to the 255th ?? Easy: it uses a variable-width representation. > I may have a rather very shallow understanding of characters encoding and > representation, That's the problem. Unless you can stick to Latin-1, you'll need to fix that understanding before contining. In Ada, type Character = Latin-1 = first 255 code positions, 8-bit representation. Text_IO and type String are for Latin-1 strings. type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code positions = UCS-2 = 16-bit representation. type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation. There is no native support in Ada for UTF-8 or UTF-16 strings. There is a conversion package (Ada.Strings.Encoding) [which is nasty because it breaks strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 (there is no good way to tell between them in the general case). Windows uses a BOM character at the start of UTF-8 files to differentiate (at least in programs like Notepad and the built-in edit control), but that is not recommended by Unicode. I think they would prefer a world where Latin-1 had disappeared completely, but that of course is not the real world. That's probably enough character set info to get you into trouble. ;-) Randy. and that's quite an understatement, but you said: "Ada's Character has Latin-1 encoding which differs from UTF-8 in the code positions greater than 127" > Really ?? You're sayin' there position such as Wide_Character'Val(X) > doesn't correspond to the Xth character in the UNICODE standard ?? > And I know peanuts about the UCS-2 thing. I'm too ignorant for getting one > bit of your saying, except it sounds like heresy in the ears of the Ada > Church. Burn them all !! > Ada.stream permits output of bits without any formatting, right ? If so, > it might do. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 23:57 ` Randy Brukardt @ 2017-12-28 5:20 ` Robert Eachus 2017-12-31 21:41 ` Keith Thompson 0 siblings, 1 reply; 26+ messages in thread From: Robert Eachus @ 2017-12-28 5:20 UTC (permalink / raw) On Wednesday, December 27, 2017 at 6:58:01 PM UTC-5, Randy Brukardt wrote: > "Mehdi Saada" <00120260a@gmail.com> wrote in message > news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com... > >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably > >> meant output of code points. That is a different beast. Convert a code > >> point to UTF-8 string and output that. E.g. > > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string > > even represent > > codepoints next to the 255th ?? > > Easy: it uses a variable-width representation. > > > I may have a rather very shallow understanding of characters encoding and > > representation, > > That's the problem. Unless you can stick to Latin-1, you'll need to fix that > understanding before contining. > > In Ada, type Character = Latin-1 = first 255 code positions, 8-bit > representation. Text_IO and type String are for Latin-1 strings. > > type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code > positions = UCS-2 = 16-bit representation. There is also UTF16 which is identical to Unicode, characters in the range 0D800 to 0DFFF are used as escapes to allow more than 65536 code-points. > > type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation. No, all of UCS-4, everything defined in ISO-10646. > > There is no native support in Ada for UTF-8 or UTF-16 strings. There is a > conversion package (Ada.Strings.Encoding) [which is nasty because it breaks > strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and > Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 > (there is no good way to tell between them in the general case). > > Windows uses a BOM character at the start of UTF-8 files to differentiate > (at least in programs like Notepad and the built-in edit control), but that > is not recommended by Unicode. I think they would prefer a world where > Latin-1 had disappeared completely, but that of course is not the real > world. > > That's probably enough character set info to get you into trouble. ;-) Mild trouble anyway, no burnings, no heresy trials. The ISO-10646 standard does favor using the correct BOM at the start of UTF-8, UCS-2 and UCS-4. Unicode is an extended version of UCS-2 to include pages other than the 10646 BMP (Basic multilingual plane). Using a BOM with Unicode may mislead a program reading the file. The problem is not telling Unicode from UCS-2 when they are different. There no differences between Unicode and UCS-2 and unless those extra pages are used. Files in most languages will be identical. Even Japanese and Chinese may not be detectable--unless you omit the BOM for Unicode files. ;-) > > Really ?? You're sayin' there position such as Wide_Character'Val(X) > > doesn't correspond to the Xth character in the UNICODE standard ?? Whoo boy, digging a deep hole here. You have to keep in mind that there are at least three character sets that matter when you are programming in Ada (or any other language.) First, there is the character set that you use to create the program. The Ada standard provides a default, and it is the one that the compiler tests use. But it is only a default, and GNAT accepts source in different formats. Back when Ada was new, there were compilers for programs written in IBM's EBCDIC. The second character set you care about (or set of them) are the Ada Character type, and other character types. In the IBM compiler above Character corresponded to ASCII as expected. The ordering of character literals was ASCII not EBCDIC, etc. The third group of character sets are those that correspond to printers, displays and keyboards. If you need to write code that supports, say Cyrillic terminals, you may end up with strings that are really in say Russian. Best to gather them all in one "Language" package, to make it easier when you have to do Ukrainian. :-( If all three character sets are the same, that's nice. But it can lead to sloppy thinking. Way back when the ARG was wrestling with this, getting everyone on the same page about which set of character sets we were discussing now, allowed us to get things into reasonable shape going into the Ada 9X development. You want your compiler to allow Shift-JIS in comments? Sure. Just remember that an end of line, and only an end of line terminates a comment. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 5:20 ` Robert Eachus @ 2017-12-31 21:41 ` Keith Thompson 0 siblings, 0 replies; 26+ messages in thread From: Keith Thompson @ 2017-12-31 21:41 UTC (permalink / raw) Robert Eachus <rieachus@comcast.net> writes: > On Wednesday, December 27, 2017 at 6:58:01 PM UTC-5, Randy Brukardt wrote: >> "Mehdi Saada" <00120260a@gmail.com> wrote in message >> news:a4b2d34b-e428-4f30-8fa2-eb06816c72b6@googlegroups.com... >> >> Wide Text_IO is UCS-2. Keep on using UTF-8. You probably >> >> meant output of code points. That is a different beast. Convert a code >> >> point to UTF-8 string and output that. E.g. >> > Sure I'll look to your work, but ... Fundamentaly, how can a UTF8 string >> > even represent >> > codepoints next to the 255th ?? >> >> Easy: it uses a variable-width representation. >> >> > I may have a rather very shallow understanding of characters encoding and >> > representation, >> >> That's the problem. Unless you can stick to Latin-1, you'll need to fix that >> understanding before contining. >> >> In Ada, type Character = Latin-1 = first 255 code positions, 8-bit >> representation. Text_IO and type String are for Latin-1 strings. >> >> type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code >> positions = UCS-2 = 16-bit representation. > > There is also UTF16 which is identical to Unicode, characters in the > range 0D800 to 0DFFF are used as escapes to allow more than 65536 > code-points. Unicode specifies code points, numeric values for each of a large number of characters. UTF-8, UTF-16, and UTF-32/UCS-4 are *representations* of Unicode. They're all able to represent all Unicode characters, and they differ in how they do so. (ASCII, Latin-1, and UCS-2 are representations of small subsets of Unicode.) >> type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation. > > No, all of UCS-4, everything defined in ISO-10646. What are you saying "No" to? >> There is no native support in Ada for UTF-8 or UTF-16 strings. There is a >> conversion package (Ada.Strings.Encoding) [which is nasty because it breaks >> strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and >> Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 >> (there is no good way to tell between them in the general case). >> >> Windows uses a BOM character at the start of UTF-8 files to differentiate >> (at least in programs like Notepad and the built-in edit control), but that >> is not recommended by Unicode. I think they would prefer a world where >> Latin-1 had disappeared completely, but that of course is not the real >> world. >> >> That's probably enough character set info to get you into trouble. ;-) > > Mild trouble anyway, no burnings, no heresy trials. The ISO-10646 > standard does favor using the correct BOM at the start of UTF-8, UCS-2 > and UCS-4. Unicode is an extended version of UCS-2 to include pages > other than the 10646 BMP (Basic multilingual plane). Using a BOM with > Unicode may mislead a program reading the file. The problem is not > telling Unicode from UCS-2 when they are different. There no > differences between Unicode and UCS-2 and unless those extra pages are > used. Files in most languages will be identical. Even Japanese and > Chinese may not be detectable--unless you omit the BOM for Unicode > files. ;-) The above is correct if you replace "Unicode" by "UTF-16". UCS-2 uses 2 bytes per character, with no mechanism for representation code points above 65535. UTF-16 is based on UCS-2, with a mechanism for using multiple 2-byte sequences to represent code points above 65535. (In Windows, it's common to refer to Windows-1252 as "ANSI" and UTF-16 as "Unicode". Both are incorrect. Windows-1252 was submitted to ANSI for standardization, but was never approved. UTF-16 is a representation of Unicode.) I don't know what ISO-10646 recommends, but using a BOM with UTF-8 files causes problems on Unix-like systems. On such systems, most text files these days are UTF-8 and most do not have a BOM (because it's not needed; BOM is a byte order mark, and UTF-8 has no variations in byte ordering). -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Working, but not speaking, for JetHead Development, Inc. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister" ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 22:32 ` Mehdi Saada 2017-12-27 22:33 ` Mehdi Saada 2017-12-27 23:57 ` Randy Brukardt @ 2017-12-28 9:04 ` Dmitry A. Kazakov 2017-12-28 11:06 ` Niklas Holsti 2 siblings, 1 reply; 26+ messages in thread From: Dmitry A. Kazakov @ 2017-12-28 9:04 UTC (permalink / raw) On 2017-12-27 23:32, Mehdi Saada wrote: > Fundamentaly, how can a UTF8 string even represent codepoints next to the 255th ?? UTF-8 uses a chain code to represent large integers. ASCII 7-bit is coded as-as. Other characters require more than one octet. It is a technique widely used in communication for lossless compression. The drawback is that you cannot directly index characters in an UTF-8 string. But virtually no text processing algorithm need that. So not a loss, actual. In short, representation unit (octet) /= represented thing (character). > Superscripts and subscripts means more change in the IO package. > Before I could simply use the generic Integer_IO, but I have no clue > how to do to output a specific code point for each digit in a > specific base... wouldn't that mean rewriting part of Integer_IO ? You mean the standard library Integer_IO? Sure, you will have to replace it. > I may have a rather very shallow understanding of characters > encoding and representation, and that's quite an understatement, but > you said: "Ada's Character has Latin-1 encoding which differs from > UTF-8 in the code positions greater than 127" > Really ?? Yep. Latin-1 and UTF-8 have different representation. Both have ASCII 7-bit as a subset. > You're sayin' there position such as Wide_Character'Val(X) > doesn't correspond to the Xth character in the UNICODE standard ?? Character = Latin-1 Wide_Character = UCS-2 Wide_Wide_Character = UCS-4 Linux uses UTF-8 (for a long time). Windows uses either ASCII (so-called A-calls) or UTF-16 (so-called W-calls). There was a time, long ago, when Windows used UCS-2, but then they ditched it for UTF-16. Now, Ada programmers insolently ignore the standard and pragmatically use: Character = representation unit of UTF-8 (octet) Wide_Character = representation unit of UTF-16 Wide_Wide_Character = UNICODE code point This works most of the time, but one should be careful. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 9:04 ` Dmitry A. Kazakov @ 2017-12-28 11:06 ` Niklas Holsti 2017-12-28 11:50 ` Dmitry A. Kazakov 0 siblings, 1 reply; 26+ messages in thread From: Niklas Holsti @ 2017-12-28 11:06 UTC (permalink / raw) On 17-12-28 11:04 , Dmitry A. Kazakov wrote: > On 2017-12-27 23:32, Mehdi Saada wrote: [snip] >> Superscripts and subscripts means more change in the IO package. >> Before I could simply use the generic Integer_IO, but I have no clue >> how to do to output a specific code point for each digit in a >> specific base... wouldn't that mean rewriting part of Integer_IO ? > > You mean the standard library Integer_IO? Sure, you will have to replace > it. It seems simpler to continue using Integer_IO, but to Put the number into a String, and then translate the digits in the resulting String into superscript or subscript form, as desired. The translation for decimal digits 0..9 seems quite simple (https://en.wikipedia.org/wiki/Superscripts_and_Subscripts). Using the Unicode "fraction slash" seems less reliable, to judge from the hints in https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts: "Some browsers support this". -- Niklas Holsti Tidorum Ltd niklas holsti tidorum fi . @ . ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 11:06 ` Niklas Holsti @ 2017-12-28 11:50 ` Dmitry A. Kazakov 0 siblings, 0 replies; 26+ messages in thread From: Dmitry A. Kazakov @ 2017-12-28 11:50 UTC (permalink / raw) On 2017-12-28 12:06, Niklas Holsti wrote: > On 17-12-28 11:04 , Dmitry A. Kazakov wrote: >> On 2017-12-27 23:32, Mehdi Saada wrote: > [snip] >>> Superscripts and subscripts means more change in the IO package. >>> Before I could simply use the generic Integer_IO, but I have no clue >>> how to do to output a specific code point for each digit in a >>> specific base... wouldn't that mean rewriting part of Integer_IO ? >> >> You mean the standard library Integer_IO? Sure, you will have to replace >> it. > > It seems simpler to continue using Integer_IO, but to Put the number > into a String, and then translate the digits in the resulting String > into superscript or subscript form, as desired. Translating integer into decimal digits is arguably easier than conversion of ASCII codes for decimal digits (and sign) into UTF-8 subscript and superscript chains of octets. And the procedures and functions for sub-/superscript string I/O are ready. No need to rewrite them. BTW, Integer_IO is quite uncomfortable to use with strings. This was the reason why I redesigned its interface as: procedure Put ( Destination : in out String; Pointer : in out Integer; Value : Number'Base; Base : NumberBase := 10; PutPlus : Boolean := False; Field : Natural := 0; Justify : Alignment := Left; Fill : Character := ' ' ); instead of: procedure Put ( To : out String; Item : in Num; Base : in Number_Base := Default_Base ); which requires trimming and thus has little advantage over plain Num'Image. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 18:08 unicode and wide_text_io Mehdi Saada 2017-12-27 20:04 ` Dmitry A. Kazakov 2017-12-27 22:32 ` Mehdi Saada @ 2017-12-28 13:15 ` Mehdi Saada 2017-12-28 14:25 ` Dmitry A. Kazakov 2017-12-28 22:36 ` Mehdi Saada 3 siblings, 1 reply; 26+ messages in thread From: Mehdi Saada @ 2017-12-28 13:15 UTC (permalink / raw) Ok, I'm done with it. It sure is interesting, but I don't want to even think about all this stuff for the time being... Talk about "universal standard", when it's (apparently) it's far from universal or uniform ! > Easy: it uses a variable-width representation. Under the assumption terminals will be able to display it... well, whatever I use in the end, I've got to suppose it anyway. I'll probably stick with Latin-1, if it doesn't look as nice as intended. If so I'll forget about slash or what not. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 13:15 ` Mehdi Saada @ 2017-12-28 14:25 ` Dmitry A. Kazakov 2017-12-28 14:32 ` Simon Wright 0 siblings, 1 reply; 26+ messages in thread From: Dmitry A. Kazakov @ 2017-12-28 14:25 UTC (permalink / raw) On 2017-12-28 14:15, Mehdi Saada wrote: > Ok, I'm done with it. It sure is interesting, but I don't want to > even think about all this stuff for the time being... Talk about > "universal standard", when it's (apparently) it's far from universal or uniform ! It is. Everybody uses UTF-8. Even under Windows. The text is converted from/to UTF-16 right after or before passing it to the system call. All processing is UTF-8. E.g. GTK uses UTF-8 consistently no matter what OS. >> Easy: it uses a variable-width representation. > Under the assumption terminals will be able to display it... well, > whatever I use in the end, I've got to suppose it anyway. Sure they are Linux and Windows. Take this program: ------------------------------------ with Ada.Text_IO; use Ada.Text_IO; procedure Superscript is begin Put_Line ( "Superscript 1=" & Character'Val (194) & Character'Val (185) ); end Superscript; ------------------------------------ Start Windows console: > gnatmake superscript.adb > chcp 65001 > superscript This will, depending on the font, nicely output: Superscript 1=¹ P.S. Batch command chcp selects the code page of the console. 65001 is for UTF-8. P.P.S. Some Windows fonts do not have sub-/superscript glyphs. So you might wish to set the console to Lucida or some other fixed space font with Unicode support. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 14:25 ` Dmitry A. Kazakov @ 2017-12-28 14:32 ` Simon Wright 2017-12-28 15:28 ` Niklas Holsti 0 siblings, 1 reply; 26+ messages in thread From: Simon Wright @ 2017-12-28 14:32 UTC (permalink / raw) "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: > On 2017-12-28 14:15, Mehdi Saada wrote: > >> Ok, I'm done with it. It sure is interesting, but I don't want to >> even think about all this stuff for the time being... Talk about >> "universal standard", when it's (apparently) it's far from universal >> or uniform ! > > It is. Everybody uses UTF-8. Even under Windows. The text is converted > from/to UTF-16 right after or before passing it to the system > call. All processing is UTF-8. E.g. GTK uses UTF-8 consistently no > matter what OS. > >>> Easy: it uses a variable-width representation. >> Under the assumption terminals will be able to display it... well, >> whatever I use in the end, I've got to suppose it anyway. > > Sure they are Linux and Windows. > > Take this program: > ------------------------------------ > with Ada.Text_IO; use Ada.Text_IO; > procedure Superscript is > begin > Put_Line > ( "Superscript 1=" > & Character'Val (194) > & Character'Val (185) > ); > end Superscript; > ------------------------------------ works fine on macOS (no chcp messing needed!) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 14:32 ` Simon Wright @ 2017-12-28 15:28 ` Niklas Holsti 2017-12-28 15:47 ` 00120260b 2017-12-28 18:15 ` Simon Wright 0 siblings, 2 replies; 26+ messages in thread From: Niklas Holsti @ 2017-12-28 15:28 UTC (permalink / raw) On 17-12-28 16:32 , Simon Wright wrote: > "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: > >> On 2017-12-28 14:15, Mehdi Saada wrote: >> >>> Ok, I'm done with it. It sure is interesting, but I don't want to >>> even think about all this stuff for the time being... Talk about >>> "universal standard", when it's (apparently) it's far from universal >>> or uniform ! >> >> It is. Everybody uses UTF-8. Even under Windows. The text is converted >> from/to UTF-16 right after or before passing it to the system >> call. All processing is UTF-8. E.g. GTK uses UTF-8 consistently no >> matter what OS. >> >>>> Easy: it uses a variable-width representation. >>> Under the assumption terminals will be able to display it... well, >>> whatever I use in the end, I've got to suppose it anyway. >> >> Sure they are Linux and Windows. >> >> Take this program: >> ------------------------------------ >> with Ada.Text_IO; use Ada.Text_IO; >> procedure Superscript is >> begin >> Put_Line >> ( "Superscript 1=" >> & Character'Val (194) >> & Character'Val (185) >> ); >> end Superscript; >> ------------------------------------ > > works fine on macOS (no chcp messing needed!) Depends on the Preferences (-> Settings -> Advanced: Character encoding) you set for the Mac Terminal program. While UTF-8 is one of the available encodings, I normally have it set to Latin-1, to match Ada and GNAT. Latin-1 is fine for the languages I mostly use (English, Swedish, Finnish). -- Niklas Holsti Tidorum Ltd niklas holsti tidorum fi . @ . ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 15:28 ` Niklas Holsti @ 2017-12-28 15:47 ` 00120260b 2017-12-28 22:35 ` G.B. 2017-12-28 18:15 ` Simon Wright 1 sibling, 1 reply; 26+ messages in thread From: 00120260b @ 2017-12-28 15:47 UTC (permalink / raw) Then, how come the norm hasn't made it a bit easier to input/ouput post-latin-1 characters ? Why aren't other norms/characters set/encodings more like special cases ? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 15:47 ` 00120260b @ 2017-12-28 22:35 ` G.B. 0 siblings, 0 replies; 26+ messages in thread From: G.B. @ 2017-12-28 22:35 UTC (permalink / raw) On 28.12.17 16:47, 00120260b@gmail.com wrote: > Then, how come the norm hasn't made it a bit easier to input/ouput post-latin-1 characters ? Why aren't other norms/characters set/encodings more like special cases ? > Actually, output of non-7-bit, unambiguously encoded text has been made reasonably easy, I'd say, also defaulting to what should be expected: with Ada.Wide_Text_IO.Text_Streams; with Ada.Strings.UTF_Encoding.Wide_Strings; procedure UTF is -- USD/EUR, i.e. "$/€" Ratio : constant Wide_String := "$/" & Wide_Character'Val (16#20AC#); use Ada.Wide_Text_Io, Ada.Strings; begin Put_Line (Ratio); -- use defaults, traditional String'Write -- stream output, force UTF-8 (Text_Streams.Stream (Current_Output), UTF_Encoding.Wide_Strings.Encode (Ratio)); end UTF; The above source text uses only 7 bit encoding for post- latin-1 strings. Only comment text is using a wide_character. If, instead, source text is encoded by "more" bits, and using post-latin-1 literals or identifiers, then the compiler may need to be told. I think that BOMs may be of use, and in any case, there are compiler switches or some other vendor specific vocabulary describing source text. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 15:28 ` Niklas Holsti 2017-12-28 15:47 ` 00120260b @ 2017-12-28 18:15 ` Simon Wright 1 sibling, 0 replies; 26+ messages in thread From: Simon Wright @ 2017-12-28 18:15 UTC (permalink / raw) Niklas Holsti <niklas.holsti@tidorum.invalid> writes: > On 17-12-28 16:32 , Simon Wright wrote: >> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: >>> Take this program: >>> ------------------------------------ >>> with Ada.Text_IO; use Ada.Text_IO; >>> procedure Superscript is >>> begin >>> Put_Line >>> ( "Superscript 1=" >>> & Character'Val (194) >>> & Character'Val (185) >>> ); >>> end Superscript; >>> ------------------------------------ >> >> works fine on macOS (no chcp messing needed!) > > Depends on the Preferences (-> Settings -> Advanced: Character > encoding) you set for the Mac Terminal program. While UTF-8 is one of > the available encodings, I normally have it set to Latin-1, to match > Ada and GNAT. Latin-1 is fine for the languages I mostly use (English, > Swedish, Finnish). I dare say you can do something similar under Linux? The setting is in Preferences -> Profiles -> Advanced on High Sierra, and Unicode (UTF-8) is what I have set; I don't recall changing it, so I guss it's the default. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-27 18:08 unicode and wide_text_io Mehdi Saada ` (2 preceding siblings ...) 2017-12-28 13:15 ` Mehdi Saada @ 2017-12-28 22:36 ` Mehdi Saada 2017-12-29 0:51 ` Randy Brukardt 2017-12-30 12:50 ` Björn Lundin 3 siblings, 2 replies; 26+ messages in thread From: Mehdi Saada @ 2017-12-28 22:36 UTC (permalink / raw) I took some time to read here and there on the topics of encoding, character sets, unicode, what is UTF8,16 and 32, little and big endian, BOM, etc. Now I've done that, your comments Dmitry sounds accurate, and it turned out I really knew nothing about [ban]characters[/ban]/glyphs/code points. Wasn't so complicated in the end. I'll look at your work in no time. Since I long to work in the area of interface and commandline utilities, the sooner I learn all about characters, the better. Thanks for your explanation, you guys ;-) Myself: > there are positions such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ?? Of course: Character'val(156) to 'val(255) are one byte long, whereas in UTF8 the corresponding code points are encoded with two bytes. Did I understood the lesson ? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 22:36 ` Mehdi Saada @ 2017-12-29 0:51 ` Randy Brukardt 2017-12-30 12:50 ` Björn Lundin 1 sibling, 0 replies; 26+ messages in thread From: Randy Brukardt @ 2017-12-29 0:51 UTC (permalink / raw) "Mehdi Saada" <00120260a@gmail.com> wrote in message news:023dc29b-dbc5-4fc8-b44f-d748517adec8@googlegroups.com... ... > Myself: >> there are positions such as Wide_Character'Val(X) doesn't correspond to >> the Xth character in the UNICODE standard ?? > Of course: Character'val(156) to 'val(255) are one byte long, whereas in > UTF8 the corresponding code points are encoded with two bytes. Did I > understood the lesson ? Yup, that's right. And it depends on what the display device is handling as to whether UTF-8 is recognized. If you don't include Dmitry's chcp command on Windows, most likely his program will output garbage. (On my computer, the default code page is 437, which certainly won't display UTF-8 strings!) Randy. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-28 22:36 ` Mehdi Saada 2017-12-29 0:51 ` Randy Brukardt @ 2017-12-30 12:50 ` Björn Lundin 2017-12-30 15:33 ` Dennis Lee Bieber 1 sibling, 1 reply; 26+ messages in thread From: Björn Lundin @ 2017-12-30 12:50 UTC (permalink / raw) On 2017-12-28 23:36, Mehdi Saada wrote: > Myself: >> there are positions such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ?? > Of course: Character'val(156) to 'val(255) are one byte long, whereas in UTF8 the corresponding code points are encoded with two bytes. Did I understood the lesson ? Yes - if it fits into 2 bytes. if not UTF-8 uses 3 and 4 bytes instead. So UTF-8 can use codepoints up to 32 bits (ca 4 billion) codepoint between 1 -> 2**8 -1 = 1 byte 2**8 -> 2**16 -1 = 2 bytes 2**16 -> 2**24 -1 = 3 bytes 2**24 -> 2**32 -1 = 4 bytes -- -- Björn ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-30 12:50 ` Björn Lundin @ 2017-12-30 15:33 ` Dennis Lee Bieber 2017-12-30 15:56 ` Dmitry A. Kazakov 2017-12-30 23:20 ` Björn Lundin 0 siblings, 2 replies; 26+ messages in thread From: Dennis Lee Bieber @ 2017-12-30 15:33 UTC (permalink / raw) On Sat, 30 Dec 2017 13:50:37 +0100, Björn Lundin <b.f.lundin@gmail.com> declaimed the following: >On 2017-12-28 23:36, Mehdi Saada wrote: >> Myself: >>> there are positions such as Wide_Character'Val(X) doesn't correspond to the Xth character in the UNICODE standard ?? >> Of course: Character'val(156) to 'val(255) are one byte long, whereas in UTF8 the corresponding code points are encoded with two bytes. Did I understood the lesson ? > >Yes - if it fits into 2 bytes. if not UTF-8 uses 3 and 4 bytes instead. >So UTF-8 can use codepoints up to 32 bits (ca 4 billion) > >codepoint between >1 -> 2**8 -1 = 1 byte Isn't that 0..2^7... Any byte with the MSB set is a multibyte code (and number of MSB bits set before a 0 bit indicates how many bytes). >2**8 -> 2**16 -1 = 2 bytes >2**16 -> 2**24 -1 = 3 bytes >2**24 -> 2**32 -1 = 4 bytes > >-- -- Wulfraed Dennis Lee Bieber AF6VN wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/ ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-30 15:33 ` Dennis Lee Bieber @ 2017-12-30 15:56 ` Dmitry A. Kazakov 2017-12-30 23:20 ` Björn Lundin 1 sibling, 0 replies; 26+ messages in thread From: Dmitry A. Kazakov @ 2017-12-30 15:56 UTC (permalink / raw) On 2017-12-30 16:33, Dennis Lee Bieber wrote: > Isn't that 0..2^7... Any byte with the MSB set is a multibyte code (and > number of MSB bits set before a 0 bit indicates how many bytes). Yes. Furthermore, the subsequent octets have MSB set. The reason for this "waste" is to allow bidirectional scanning of UTF-8 strings. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: unicode and wide_text_io 2017-12-30 15:33 ` Dennis Lee Bieber 2017-12-30 15:56 ` Dmitry A. Kazakov @ 2017-12-30 23:20 ` Björn Lundin 1 sibling, 0 replies; 26+ messages in thread From: Björn Lundin @ 2017-12-30 23:20 UTC (permalink / raw) On 2017-12-30 16:33, Dennis Lee Bieber wrote: >> codepoint between >> 1 -> 2**8 -1 = 1 byte > Isn't that 0..2^7... Any byte with the MSB set is a multibyte code (and > number of MSB bits set before a 0 bit indicates how many bytes). > >> 2**8 -> 2**16 -1 = 2 bytes >> 2**16 -> 2**24 -1 = 3 bytes >> 2**24 -> 2**32 -1 = 4 bytes You are probably right, I meant to point out the principle. That UTF-8 can be more that 2 bytes. That it expands as needed up to 4 bytes. -- -- Björn ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2017-12-31 21:41 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-12-27 18:08 unicode and wide_text_io Mehdi Saada 2017-12-27 20:04 ` Dmitry A. Kazakov 2017-12-27 21:47 ` Dennis Lee Bieber 2017-12-27 22:32 ` Mehdi Saada 2017-12-27 22:33 ` Mehdi Saada 2017-12-27 22:48 ` Mehdi Saada 2017-12-27 23:32 ` Mehdi Saada 2017-12-27 23:57 ` Randy Brukardt 2017-12-28 5:20 ` Robert Eachus 2017-12-31 21:41 ` Keith Thompson 2017-12-28 9:04 ` Dmitry A. Kazakov 2017-12-28 11:06 ` Niklas Holsti 2017-12-28 11:50 ` Dmitry A. Kazakov 2017-12-28 13:15 ` Mehdi Saada 2017-12-28 14:25 ` Dmitry A. Kazakov 2017-12-28 14:32 ` Simon Wright 2017-12-28 15:28 ` Niklas Holsti 2017-12-28 15:47 ` 00120260b 2017-12-28 22:35 ` G.B. 2017-12-28 18:15 ` Simon Wright 2017-12-28 22:36 ` Mehdi Saada 2017-12-29 0:51 ` Randy Brukardt 2017-12-30 12:50 ` Björn Lundin 2017-12-30 15:33 ` Dennis Lee Bieber 2017-12-30 15:56 ` Dmitry A. Kazakov 2017-12-30 23:20 ` Björn Lundin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox