* Ada and Unicode @ 2021-04-17 22:03 DrPi 2021-04-18 0:02 ` Luke A. Guest ` (4 more replies) 0 siblings, 5 replies; 63+ messages in thread From: DrPi @ 2021-04-17 22:03 UTC (permalink / raw) Hi, I have a good knowledge of Unicode : code points, encoding... What I don't understand is how to manage Unicode strings with Ada. I've read part of ARM and did some tests without success. I managed to be partly successful with source code encoded in Latin-1. Any other encoding failed. Any way to use source code encoded in UTF-8 ? In some languages, it is possible to set a tag at the beginning of the source file to direct the compiler which encoding to use. I wasn't successful using -gnatW8 switch. But maybe I made to many tests and my brain was scrambled. Even with source code encoded in Latin-1, I've not been able to manage Unicode strings correctly. What's the way to manage Unicode correctly ? Regards, Nicolas ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-17 22:03 Ada and Unicode DrPi @ 2021-04-18 0:02 ` Luke A. Guest 2021-04-19 9:09 ` DrPi 2021-04-19 8:29 ` Maxim Reznik ` (3 subsequent siblings) 4 siblings, 1 reply; 63+ messages in thread From: Luke A. Guest @ 2021-04-18 0:02 UTC (permalink / raw) On 17/04/2021 23:03, DrPi wrote: > Hi, > > I have a good knowledge of Unicode : code points, encoding... > What I don't understand is how to manage Unicode strings with Ada. I've > read part of ARM and did some tests without success. It's a mess imo. I've complained about it before. The official stance is that the standard defines that a compiler should accept the ISO equivalent of Unicode and that a compiler should implement a flawed system, especially UTF-8 types, http://www.ada-auth.org/standards/rm12_w_tc1/html/RM-A-4-11.html Unicode is a bit painful, I've messed about with it to some degree here https://github.com/Lucretia/uca. There are other attempts: 1. http://www.dmitry-kazakov.de/ada/strings_edit.htm 2. https://github.com/reznikmm/matreshka (very heavy, many layers) 3. https://github.com/Blady-Com/UXStrings I remember getting an exception converting from my unicode_string to a wide_wide string for some reason ages ago. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-18 0:02 ` Luke A. Guest @ 2021-04-19 9:09 ` DrPi 0 siblings, 0 replies; 63+ messages in thread From: DrPi @ 2021-04-19 9:09 UTC (permalink / raw) Le 18/04/2021 à 02:02, Luke A. Guest a écrit : > > On 17/04/2021 23:03, DrPi wrote: >> Hi, >> >> I have a good knowledge of Unicode : code points, encoding... >> What I don't understand is how to manage Unicode strings with Ada. >> I've read part of ARM and did some tests without success. > > It's a mess imo. I've complained about it before. The official stance is > that the standard defines that a compiler should accept the ISO > equivalent of Unicode and that a compiler should implement a flawed > system, especially UTF-8 types, > http://www.ada-auth.org/standards/rm12_w_tc1/html/RM-A-4-11.html > > Unicode is a bit painful, I've messed about with it to some degree here > https://github.com/Lucretia/uca. > > There are other attempts: > > 1. http://www.dmitry-kazakov.de/ada/strings_edit.htm > 2. https://github.com/reznikmm/matreshka (very heavy, many layers) > 3. https://github.com/Blady-Com/UXStrings > > I remember getting an exception converting from my unicode_string to a > wide_wide string for some reason ages ago. Thanks ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-17 22:03 Ada and Unicode DrPi 2021-04-18 0:02 ` Luke A. Guest @ 2021-04-19 8:29 ` Maxim Reznik 2021-04-19 9:28 ` DrPi 2021-04-19 11:15 ` Simon Wright 2021-04-19 9:08 ` Stephen Leake ` (2 subsequent siblings) 4 siblings, 2 replies; 63+ messages in thread From: Maxim Reznik @ 2021-04-19 8:29 UTC (permalink / raw) воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi: > > Any way to use source code encoded in UTF-8 ? Yes, with GNAT just use "-gnatW8" for compiler flag (in command line or your project file): -- main.adb: with Ada.Wide_Wide_Text_IO; procedure Main is Привет : constant Wide_Wide_String := "Привет"; begin Ada.Wide_Wide_Text_IO.Put_Line (Привет); end Main; $ gprbuild -gnatW8 main.adb $ ./main Привет > In some languages, it is possible to set a tag at the beginning of the > source file to direct the compiler which encoding to use. You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file. Take a look: -- main.adb: pragma Wide_Character_Encoding (UTF8); with Ada.Wide_Wide_Text_IO; procedure Main is Привет : constant Wide_Wide_String := "Привет"; begin Ada.Wide_Wide_Text_IO.Put_Line (Привет); end Main; $ gprbuild main.adb $ ./main Привет > What's the way to manage Unicode correctly ? > You can use Wide_Wide_String and Unbounded_Wide_Wide_String type to process Unicode strings. But this is not very handy. I use the Matreshka library for Unicode strings. It has a lot of features (regexp, string vectors, XML, JSON, databases, Web Servlets, template engine, etc.). URL: https://forge.ada-ru.org/matreshka > Regards, > Nicolas ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 8:29 ` Maxim Reznik @ 2021-04-19 9:28 ` DrPi 2021-04-19 13:50 ` Maxim Reznik 2021-04-19 11:15 ` Simon Wright 1 sibling, 1 reply; 63+ messages in thread From: DrPi @ 2021-04-19 9:28 UTC (permalink / raw) Le 19/04/2021 à 10:29, Maxim Reznik a écrit : > воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi: >> >> Any way to use source code encoded in UTF-8 ? > > Yes, with GNAT just use "-gnatW8" for compiler flag (in command line or your project file): > > -- main.adb: > with Ada.Wide_Wide_Text_IO; > > procedure Main is > Привет : constant Wide_Wide_String := "Привет"; > begin > Ada.Wide_Wide_Text_IO.Put_Line (Привет); > end Main; > > $ gprbuild -gnatW8 main.adb > $ ./main > Привет > > >> In some languages, it is possible to set a tag at the beginning of the >> source file to direct the compiler which encoding to use. > > You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file. Take a look: > > -- main.adb: > pragma Wide_Character_Encoding (UTF8); > > with Ada.Wide_Wide_Text_IO; > > procedure Main is > Привет : constant Wide_Wide_String := "Привет"; > begin > Ada.Wide_Wide_Text_IO.Put_Line (Привет); > end Main; > > $ gprbuild main.adb > $ ./main > Привет > Wide and Wide_Wide characters and UTF-8 are two distinct things. Wide and Wide_Wide characters are supposed to contain Unicode code points (Unicode characters). UTF-8 is a stream of bytes, the encoding of Wide or Wide_Wide characters. What's the purpose of "pragma Wide_Character_Encoding (UTF8);" ? > > >> What's the way to manage Unicode correctly ? >> > > You can use Wide_Wide_String and Unbounded_Wide_Wide_String type to process Unicode strings. But this is not very handy. I use the Matreshka library for Unicode strings. It has a lot of features (regexp, string vectors, XML, JSON, databases, Web Servlets, template engine, etc.). URL: https://forge.ada-ru.org/matreshka Thanks > >> Regards, >> Nicolas ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 9:28 ` DrPi @ 2021-04-19 13:50 ` Maxim Reznik 2021-04-19 15:51 ` DrPi 0 siblings, 1 reply; 63+ messages in thread From: Maxim Reznik @ 2021-04-19 13:50 UTC (permalink / raw) понедельник, 19 апреля 2021 г. в 12:28:39 UTC+3, DrPi: > Le 19/04/2021 à 10:29, Maxim Reznik a écrit : > > воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi: > >> In some languages, it is possible to set a tag at the beginning of the > >> source file to direct the compiler which encoding to use. > > > > You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file. > > > Wide and Wide_Wide characters and UTF-8 are two distinct things. > Wide and Wide_Wide characters are supposed to contain Unicode code > points (Unicode characters). > UTF-8 is a stream of bytes, the encoding of Wide or Wide_Wide characters. Yes, it is. > What's the purpose of "pragma Wide_Character_Encoding (UTF8);" ? This pragma specifies the character encoding to be used in program source text... https://docs.adacore.com/gnat_rm-docs/html/gnat_rm/gnat_rm/implementation_defined_pragmas.html#pragma-wide-character-encoding I would suggest also this article to read: https://two-wrongs.com/unicode-strings-in-ada-2012 Best regards, ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:50 ` Maxim Reznik @ 2021-04-19 15:51 ` DrPi 0 siblings, 0 replies; 63+ messages in thread From: DrPi @ 2021-04-19 15:51 UTC (permalink / raw) Le 19/04/2021 à 15:50, Maxim Reznik a écrit : > понедельник, 19 апреля 2021 г. в 12:28:39 UTC+3, DrPi: >> Le 19/04/2021 à 10:29, Maxim Reznik a écrit : >>> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi: >>>> In some languages, it is possible to set a tag at the beginning of the >>>> source file to direct the compiler which encoding to use. >>> >>> You can do this with putting the Wide_Character_Encoding pragma (This is a GNAT specific pragma) at the top of the file. >>> >> Wide and Wide_Wide characters and UTF-8 are two distinct things. >> Wide and Wide_Wide characters are supposed to contain Unicode code >> points (Unicode characters). >> UTF-8 is a stream of bytes, the encoding of Wide or Wide_Wide characters. > > Yes, it is. > >> What's the purpose of "pragma Wide_Character_Encoding (UTF8);" ? > > This pragma specifies the character encoding to be used in program source text... > > https://docs.adacore.com/gnat_rm-docs/html/gnat_rm/gnat_rm/implementation_defined_pragmas.html#pragma-wide-character-encoding Good to know. > > I would suggest also this article to read: > > https://two-wrongs.com/unicode-strings-in-ada-2012 > I think I've already read it. But will do again. > Best regards, > Thanks ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 8:29 ` Maxim Reznik 2021-04-19 9:28 ` DrPi @ 2021-04-19 11:15 ` Simon Wright 2021-04-19 11:50 ` Luke A. Guest ` (2 more replies) 1 sibling, 3 replies; 63+ messages in thread From: Simon Wright @ 2021-04-19 11:15 UTC (permalink / raw) Maxim Reznik <reznikmm@gmail.com> writes: > воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi: >> >> Any way to use source code encoded in UTF-8 ? > > Yes, with GNAT just use "-gnatW8" for compiler flag (in command line > or your project file): But don't use unit names containing international characters, at any rate if you're (interested in compiling on) Windows or macOS: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 11:15 ` Simon Wright @ 2021-04-19 11:50 ` Luke A. Guest 2021-04-19 15:53 ` DrPi 2022-04-03 19:20 ` Thomas 2 siblings, 0 replies; 63+ messages in thread From: Luke A. Guest @ 2021-04-19 11:50 UTC (permalink / raw) On 19/04/2021 12:15, Simon Wright wrote: > Maxim Reznik <reznikmm@gmail.com> writes: > >> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi: >>> >>> Any way to use source code encoded in UTF-8 ? >> >> Yes, with GNAT just use "-gnatW8" for compiler flag (in command line >> or your project file): > > But don't use unit names containing international characters, at any > rate if you're (interested in compiling on) Windows or macOS: There's no such thing as "character" any more and we need to move away from that. Unicode has the concept of a code point which is 32 bit and any "character" as we know it, or glyph, can consist of multiple code points. In my lib, nowhere near ready (whether it will be I don't know), I define octets, Unicode_String (utf-8 string) which is array of octets and Code_Points which an iterator produces as it iterates over those strings. I was intending to have an iterator for grapheme clusters and other units. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 11:15 ` Simon Wright 2021-04-19 11:50 ` Luke A. Guest @ 2021-04-19 15:53 ` DrPi 2022-04-03 19:20 ` Thomas 2 siblings, 0 replies; 63+ messages in thread From: DrPi @ 2021-04-19 15:53 UTC (permalink / raw) Le 19/04/2021 à 13:15, Simon Wright a écrit : > Maxim Reznik <reznikmm@gmail.com> writes: > >> воскресенье, 18 апреля 2021 г. в 01:03:14 UTC+3, DrPi: >>> >>> Any way to use source code encoded in UTF-8 ? >> >> Yes, with GNAT just use "-gnatW8" for compiler flag (in command line >> or your project file): > > But don't use unit names containing international characters, at any > rate if you're (interested in compiling on) Windows or macOS: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114 > Good to know. Thanks ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 11:15 ` Simon Wright 2021-04-19 11:50 ` Luke A. Guest 2021-04-19 15:53 ` DrPi @ 2022-04-03 19:20 ` Thomas 2022-04-04 6:10 ` Vadim Godunko 2022-04-04 14:33 ` Simon Wright 2 siblings, 2 replies; 63+ messages in thread From: Thomas @ 2022-04-03 19:20 UTC (permalink / raw) In article <lyfszm5xv2.fsf@pushface.org>, Simon Wright <simon@pushface.org> wrote: > But don't use unit names containing international characters, at any > rate if you're (interested in compiling on) Windows or macOS: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114 if i understand, Eric Botcazou is a gnu admin who decided to reject your bug? i find him very "low portability thinking"! it is the responsability of compilers and other underlying tools, to manage various underlying OS and FS, not of the user to avoid those that the compiler devs find too bad! (or to use the right encoding. i heard that Windows uses UTF-16, do you know about it?) clearly, To_Lower takes Latin-1. and this kind of problems would be easier to avoid if string types were stronger ... after: package Ada.Strings.UTF_Encoding ... type UTF_8_String is new String; ... end Ada.Strings.UTF_Encoding; i would have also made: package Ada.Directories ... type File_Name_String is new Ada.Strings.UTF_Encoding.UTF_8_String; ... end Ada.Directories; with probably a validity check and a Dynamic_Predicate which allows "". then, i would use File_Name_String in all Ada.Directories and Ada.*_IO. -- RAPID maintainer http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-03 19:20 ` Thomas @ 2022-04-04 6:10 ` Vadim Godunko 2022-04-04 14:19 ` Simon Wright 2023-03-30 23:35 ` Thomas 2022-04-04 14:33 ` Simon Wright 1 sibling, 2 replies; 63+ messages in thread From: Vadim Godunko @ 2022-04-04 6:10 UTC (permalink / raw) On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote: > > > But don't use unit names containing international characters, at any > > rate if you're (interested in compiling on) Windows or macOS: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114 > > and this kind of problems would be easier to avoid if string types were stronger ... > Your suggestion is unable to resolve this issue on Mac OS X. Like case sensitivity, binary compare of two strings can't compare strings in different normalization forms. Right solution is to use right type to represent any paths, and even it doesn't resolve some issues, like relative paths and change of rules at mounting points. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-04 6:10 ` Vadim Godunko @ 2022-04-04 14:19 ` Simon Wright 2022-04-04 15:11 ` Simon Wright 2022-04-05 7:59 ` Vadim Godunko 2023-03-30 23:35 ` Thomas 1 sibling, 2 replies; 63+ messages in thread From: Simon Wright @ 2022-04-04 14:19 UTC (permalink / raw) Vadim Godunko <vgodunko@gmail.com> writes: > On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote: >> >> > But don't use unit names containing international characters, at >> > any rate if you're (interested in compiling on) Windows or macOS: >> > >> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114 >> >> and this kind of problems would be easier to avoid if string types >> were stronger ... >> > > Your suggestion is unable to resolve this issue on Mac OS X. Like case > sensitivity, binary compare of two strings can't compare strings in > different normalization forms. Right solution is to use right type to > represent any paths, and even it doesn't resolve some issues, like > relative paths and change of rules at mounting points. I think that's a macOS problem that Apple aren't going to resolve* any time soon! While banging my head against PR81114 recently, I found (can't remember where) that (lower case a acute) and (lower case a, combining acute) represent the same concept and it's up to tools/operating systems etc to recognise that. Emacs, too, has a problem: it doesn't recognise the 'combining' part of (lower case a, combining acute), so what you see on your screen is "a'". * I don't know how/whether clang addresses this. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-04 14:19 ` Simon Wright @ 2022-04-04 15:11 ` Simon Wright 2022-04-05 7:59 ` Vadim Godunko 1 sibling, 0 replies; 63+ messages in thread From: Simon Wright @ 2022-04-04 15:11 UTC (permalink / raw) Simon Wright <simon@pushface.org> writes: > I think that's a macOS problem that Apple aren't going to resolve* any > time soon! While banging my head against PR81114 recently, I found > (can't remember where) that (lower case a acute) and (lower case a, > combining acute) represent the same concept and it's up to > tools/operating systems etc to recognise that. [...] > * I don't know how/whether clang addresses this. It doesn't, so far as I can tell; has the exact same problem. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-04 14:19 ` Simon Wright 2022-04-04 15:11 ` Simon Wright @ 2022-04-05 7:59 ` Vadim Godunko 2022-04-08 9:01 ` Simon Wright 1 sibling, 1 reply; 63+ messages in thread From: Vadim Godunko @ 2022-04-05 7:59 UTC (permalink / raw) On Monday, April 4, 2022 at 5:19:20 PM UTC+3, Simon Wright wrote: > I think that's a macOS problem that Apple aren't going to resolve* any > time soon! While banging my head against PR81114 recently, I found > (can't remember where) that (lower case a acute) and (lower case a, > combining acute) represent the same concept and it's up to > tools/operating systems etc to recognise that. > And will not. It is application responsibility to convert file names to NFD to pass to OS. Also, application must compare any paths after conversion to NFD, it is important to handle more complicated cases when canonical reordering is applied. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-05 7:59 ` Vadim Godunko @ 2022-04-08 9:01 ` Simon Wright 0 siblings, 0 replies; 63+ messages in thread From: Simon Wright @ 2022-04-08 9:01 UTC (permalink / raw) Vadim Godunko <vgodunko@gmail.com> writes: > On Monday, April 4, 2022 at 5:19:20 PM UTC+3, Simon Wright wrote: >> I think that's a macOS problem that Apple aren't going to resolve* any >> time soon! While banging my head against PR81114 recently, I found >> (can't remember where) that (lower case a acute) and (lower case a, >> combining acute) represent the same concept and it's up to >> tools/operating systems etc to recognise that. >> > And will not. It is application responsibility to convert file names > to NFD to pass to OS. Also, application must compare any paths after > conversion to NFD, it is important to handle more complicated cases > when canonical reordering is applied. Isn't the compiler a tool? gnatmake? gprbuild? (gnatmake handles ACATS c250002 provided you tell the compiler that the fs is case-sensitive, gprbuild doesn't even manage that) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-04 6:10 ` Vadim Godunko 2022-04-04 14:19 ` Simon Wright @ 2023-03-30 23:35 ` Thomas 1 sibling, 0 replies; 63+ messages in thread From: Thomas @ 2023-03-30 23:35 UTC (permalink / raw) sorry for the delay. In article <48309745-aa2a-47bd-a4f9-6daa843e0771n@googlegroups.com>, Vadim Godunko <vgodunko@gmail.com> wrote: > On Sunday, April 3, 2022 at 10:20:21 PM UTC+3, Thomas wrote: > > > > > But don't use unit names containing international characters, at any > > > rate if you're (interested in compiling on) Windows or macOS: > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114 > > > > and this kind of problems would be easier to avoid if string types were > > stronger ... > > > > Your suggestion is unable to resolve this issue on Mac OS X. i said "easier" not "easy". don't forget that Unicode has 2 levels : - octets <-> code points - code points <-> characters/glyphs and you can't expect the upper to work if the lower doesn't. > Like case > sensitivity, binary compare of two strings can't compare strings in different > normalization forms. Right solution is to use right type to represent any > paths, what would be the "right type", according to you? In fact, here the first question to ask is: what's the expected encoding for Ada.Text_IO.Open.Name? - is it Latin-1 because the type is String not UTF_8_String? - is it undefined because it depends on the underling FS? -- RAPID maintainer http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-03 19:20 ` Thomas 2022-04-04 6:10 ` Vadim Godunko @ 2022-04-04 14:33 ` Simon Wright 1 sibling, 0 replies; 63+ messages in thread From: Simon Wright @ 2022-04-04 14:33 UTC (permalink / raw) Thomas <fantome.forums.tDeContes@free.fr.invalid> writes: > In article <lyfszm5xv2.fsf@pushface.org>, > Simon Wright <simon@pushface.org> wrote: > >> But don't use unit names containing international characters, at any >> rate if you're (interested in compiling on) Windows or macOS: >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81114 > > if i understand, Eric Botcazou is a gnu admin who decided to reject > your bug? i find him very "low portability thinking"! To be fair, he only suspended it - you can tell I didn't want to press very far. We could remove the part where the filename is smashed to lower-case as if it were ASCII[1][2][3] (OK, perhaps Latin-1?) if the machine is Windows or (Apple if not on aarch64!!!), but that still leaves the filesystem name issue. Windows might be OK (code pages???) [1] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/adaint.c#L620 [2] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/lib-writ.adb#L812 [2] https://github.com/gcc-mirror/gcc/blob/master/gcc/ada/lib-writ.adb#L1490 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-17 22:03 Ada and Unicode DrPi 2021-04-18 0:02 ` Luke A. Guest 2021-04-19 8:29 ` Maxim Reznik @ 2021-04-19 9:08 ` Stephen Leake 2021-04-19 9:34 ` Dmitry A. Kazakov ` (3 more replies) 2021-04-19 13:18 ` Vadim Godunko 2021-04-19 22:40 ` Shark8 4 siblings, 4 replies; 63+ messages in thread From: Stephen Leake @ 2021-04-19 9:08 UTC (permalink / raw) DrPi <314@drpi.fr> writes: > Any way to use source code encoded in UTF-8 ? for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8"); from the gnat user guide, 4.3.1 Alphabetical List of All Switches: `-gnati`c'' Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w). For details of the possible selections for `c', see *note Character Set Control: 4e. This applies to identifiers in the source code `-gnatW`e'' Wide character encoding method (`e'=n/h/u/s/e/8). This applies to string and character literals. > What's the way to manage Unicode correctly ? There are two issues: Unicode in source code, that the compiler must understand, and Unicode in strings, that your program must understand. (I've never written a program that dealt with utf strings other than file names). -gnati8 tells the compiler that the source code uses utf-8 encoding. -gnatW8 tells the compiler that string literals use utf-8 encoding. package Ada.Strings.UTF_Encoding provides some facilities for dealing with utf. It does _not_ provide walking a string by code point, which would seem necessary. We could be more helpful if you show what you are trying to do, you've tried, and what errors you got. -- -- Stephe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 9:08 ` Stephen Leake @ 2021-04-19 9:34 ` Dmitry A. Kazakov 2021-04-19 11:56 ` Luke A. Guest ` (2 subsequent siblings) 3 siblings, 0 replies; 63+ messages in thread From: Dmitry A. Kazakov @ 2021-04-19 9:34 UTC (permalink / raw) On 2021-04-19 11:08, Stephen Leake wrote: > (I've never written a program that dealt with utf strings other than > file names). > > -gnati8 tells the compiler that the source code uses utf-8 encoding. > > -gnatW8 tells the compiler that string literals use utf-8 encoding. Both are recipes for disaster, especially the second. IMO the source must be strictly ASCII 7-bit. It is less dangerous to have UTF-8 or Latin-1 identifiers, they could be at least checked, except when used for external names. But string literals would be a ticking bomb. If you need a wider set than ASCII, use named constants and integer literals. E.g. Celsius : constant String := Character'Val (16#C2#) & Character'Val (16#B0#) & 'C'; > We could be more helpful if you show what you are trying to do, you've > tried, and what errors you got. True -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 9:08 ` Stephen Leake 2021-04-19 9:34 ` Dmitry A. Kazakov @ 2021-04-19 11:56 ` Luke A. Guest 2021-04-19 12:13 ` Luke A. Guest 2021-04-19 12:52 ` Dmitry A. Kazakov 2021-04-19 16:14 ` DrPi 2022-04-16 2:32 ` Thomas 3 siblings, 2 replies; 63+ messages in thread From: Luke A. Guest @ 2021-04-19 11:56 UTC (permalink / raw) On 19/04/2021 10:08, Stephen Leake wrote: >> What's the way to manage Unicode correctly ? > > There are two issues: Unicode in source code, that the compiler must > understand, and Unicode in strings, that your program must understand. And this is there the Ada standard gets it wrong, in the encodings package re utf-8. Unicode is a superset of 7-bit ASCII not Latin 1. The high bit in the leading octet indicates whether there are trailing octets. See https://github.com/Lucretia/uca/blob/master/src/uca.ads#L70 for the data layout. The first 128 "characters" in Unicode match that of 7-bit ASCII, not 8-bit ASCII, and certainly not Latin 1. Therefore this: package Ada.Strings.UTF_Encoding ... subtype UTF_8_String is String; ... end Ada.Strings.UTF_Encoding; Was absolutely and totally wrong. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 11:56 ` Luke A. Guest @ 2021-04-19 12:13 ` Luke A. Guest 2021-04-19 15:48 ` DrPi 2021-04-19 12:52 ` Dmitry A. Kazakov 1 sibling, 1 reply; 63+ messages in thread From: Luke A. Guest @ 2021-04-19 12:13 UTC (permalink / raw) On 19/04/2021 12:56, Luke A. Guest wrote: > > package Ada.Strings.UTF_Encoding > ... > subtype UTF_8_String is String; > ... > end Ada.Strings.UTF_Encoding; > > Was absolutely and totally wrong. ...and, before someone comes back with "but all the upper half of latin 1" are represented and have the same values." Yes, they do, in Code points which is a 32 bit number. In UTF-8 they are encoded as 2 octets! ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 12:13 ` Luke A. Guest @ 2021-04-19 15:48 ` DrPi 0 siblings, 0 replies; 63+ messages in thread From: DrPi @ 2021-04-19 15:48 UTC (permalink / raw) Le 19/04/2021 à 14:13, Luke A. Guest a écrit : > > On 19/04/2021 12:56, Luke A. Guest wrote: > >> >> package Ada.Strings.UTF_Encoding >> ... >> subtype UTF_8_String is String; >> ... >> end Ada.Strings.UTF_Encoding; >> >> Was absolutely and totally wrong. > > ...and, before someone comes back with "but all the upper half of latin > 1" are represented and have the same values." Yes, they do, in Code > points which is a 32 bit number. In UTF-8 they are encoded as 2 octets! A code point has no size. Like universal integers in Ada. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 11:56 ` Luke A. Guest 2021-04-19 12:13 ` Luke A. Guest @ 2021-04-19 12:52 ` Dmitry A. Kazakov 2021-04-19 13:00 ` Luke A. Guest 1 sibling, 1 reply; 63+ messages in thread From: Dmitry A. Kazakov @ 2021-04-19 12:52 UTC (permalink / raw) On 2021-04-19 13:56, Luke A. Guest wrote: > On 19/04/2021 10:08, Stephen Leake wrote: >>> What's the way to manage Unicode correctly ? >> >> There are two issues: Unicode in source code, that the compiler must >> understand, and Unicode in strings, that your program must understand. > > And this is there the Ada standard gets it wrong, in the encodings > package re utf-8. > > Unicode is a superset of 7-bit ASCII not Latin 1. The high bit in the > leading octet indicates whether there are trailing octets. See > https://github.com/Lucretia/uca/blob/master/src/uca.ads#L70 for the data > layout. The first 128 "characters" in Unicode match that of 7-bit ASCII, > not 8-bit ASCII, and certainly not Latin 1. Therefore this: > > package Ada.Strings.UTF_Encoding > ... > subtype UTF_8_String is String; > ... > end Ada.Strings.UTF_Encoding; > > Was absolutely and totally wrong. It is practical solution. Ada type system cannot express differently represented/constrained string/array/vector subtypes. Ignoring Latin-1 and using String as if it were an array of octets is the best available solution. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 12:52 ` Dmitry A. Kazakov @ 2021-04-19 13:00 ` Luke A. Guest 2021-04-19 13:10 ` Dmitry A. Kazakov ` (3 more replies) 0 siblings, 4 replies; 63+ messages in thread From: Luke A. Guest @ 2021-04-19 13:00 UTC (permalink / raw) On 19/04/2021 13:52, Dmitry A. Kazakov wrote: > It is practical solution. Ada type system cannot express differently represented/constrained string/array/vector subtypes. Ignoring Latin-1 and using String as if it were an array of octets is the best available solution. > They're different types and should be incompatible, because, well, they are. What does Ada have that allows for this that other languages doesn't? Oh yeah! Types! ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:00 ` Luke A. Guest @ 2021-04-19 13:10 ` Dmitry A. Kazakov 2021-04-19 13:15 ` Luke A. Guest 2021-04-19 13:24 ` J-P. Rosen ` (2 subsequent siblings) 3 siblings, 1 reply; 63+ messages in thread From: Dmitry A. Kazakov @ 2021-04-19 13:10 UTC (permalink / raw) On 2021-04-19 14:55, Luke A. Guest wrote: > > On 19/04/2021 13:52, Dmitry A. Kazakov wrote: > >> It is practical solution. Ada type system cannot express differently represented/constrained string/array/vector subtypes. Ignoring Latin-1 and using String as if it were an array of octets is the best available solution. >> > > They're different types and should be incompatible, because, well, they are. What does Ada have that allows for this that other languages doesn't? Oh yeah! Types! They are subtypes, differently constrained, like Positive and Integer. Operations are same values are differently constrained. It does not make sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It is same glyph differently encoded. Encoding is a representation aspect, ergo out of the interface! BTW, subtype is a type. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:10 ` Dmitry A. Kazakov @ 2021-04-19 13:15 ` Luke A. Guest 2021-04-19 13:31 ` Dmitry A. Kazakov 0 siblings, 1 reply; 63+ messages in thread From: Luke A. Guest @ 2021-04-19 13:15 UTC (permalink / raw) On 19/04/2021 14:10, Dmitry A. Kazakov wrote: >> They're different types and should be incompatible, because, well, > they are. What does Ada have that allows for this that other languages > doesn't? Oh yeah! Types! > > They are subtypes, differently constrained, like Positive and Integer. No they're not. They're subtypes only and therefore compatible. The UTF string isn't constrained in any other ways. > Operations are same values are differently constrained. It does not make > sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It is > same glyph differently encoded. Encoding is a representation aspect, > ergo out of the interface! As I already said in Unicode the glyph is not part part of Unicode. The single code point character concept doesn't exist anymore. > > BTW, subtype is a type. > subtype is a compatible type. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:15 ` Luke A. Guest @ 2021-04-19 13:31 ` Dmitry A. Kazakov 2022-04-03 17:24 ` Thomas 0 siblings, 1 reply; 63+ messages in thread From: Dmitry A. Kazakov @ 2021-04-19 13:31 UTC (permalink / raw) On 2021-04-19 15:15, Luke A. Guest wrote: > On 19/04/2021 14:10, Dmitry A. Kazakov wrote: > >>> They're different types and should be incompatible, because, well, >> they are. What does Ada have that allows for this that other languages >> doesn't? Oh yeah! Types! >> >> They are subtypes, differently constrained, like Positive and Integer. > > No they're not. They're subtypes only and therefore compatible. The UTF > string isn't constrained in any other ways. Of course it is. There could be string encodings that have no Unicode counterparts and thus missing in UTF-8/16. >> Operations are same values are differently constrained. It does not >> make sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It >> is same glyph differently encoded. Encoding is a representation >> aspect, ergo out of the interface! > > As I already said in Unicode the glyph is not part part of Unicode. The > single code point character concept doesn't exist anymore. It does not matter from practical point of view. Some Unicode's idiosyncrasies are better ignored. >> BTW, subtype is a type. > subtype is a compatible type. Ada subtype is both a sub- and supertype, i.e. substitutable [or so the compiler thinks] in both directions. A derived tagged type is substitutable in only one direction. Neither is fully "compatible", because otherwise there would be no reason to have an exactly same thing. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:31 ` Dmitry A. Kazakov @ 2022-04-03 17:24 ` Thomas 0 siblings, 0 replies; 63+ messages in thread From: Thomas @ 2022-04-03 17:24 UTC (permalink / raw) In article <s5k0ne$opv$1@gioia.aioe.org>, "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote: > On 2021-04-19 15:15, Luke A. Guest wrote: > > On 19/04/2021 14:10, Dmitry A. Kazakov wrote: > > > >>> They're different types and should be incompatible, because, well, > >> they are. What does Ada have that allows for this that other languages > >> doesn't? Oh yeah! Types! > >> > >> They are subtypes, differently constrained, like Positive and Integer. > > > > No they're not. They're subtypes only and therefore compatible. The UTF > > string isn't constrained in any other ways. > > Of course it is. There could be string encodings that have no Unicode > counterparts and thus missing in UTF-8/16. 1 there is missing a validity function to tell weather a given UTF_8_String is valid or not, and a Dynamic_Predicate on the subtype UTF_8_String connected to the function. 2 more important, (when non-ASCII,) valid UTF_8_String *do not* represent the same thing as themselves converted to String. > > >> Operations are same values are differently constrained. It does not > >> make sense to consider ASCII 'a', Latin-1 'a', UTF-8 'a' different. It > >> is same glyph differently encoded. Encoding is a representation > >> aspect, ergo out of the interface! it works because 'a' is ASCII. if you try it with a non-ASCII character, all goes wrong. -- RAPID maintainer http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:00 ` Luke A. Guest 2021-04-19 13:10 ` Dmitry A. Kazakov @ 2021-04-19 13:24 ` J-P. Rosen 2021-04-20 19:13 ` Randy Brukardt 2022-04-03 18:04 ` Thomas 2021-04-19 16:07 ` DrPi 2021-04-20 19:06 ` Randy Brukardt 3 siblings, 2 replies; 63+ messages in thread From: J-P. Rosen @ 2021-04-19 13:24 UTC (permalink / raw) Le 19/04/2021 à 15:00, Luke A. Guest a écrit : > They're different types and should be incompatible, because, well, they > are. What does Ada have that allows for this that other languages > doesn't? Oh yeah! Types! They are not so different. For example, you may read the first line of a file in a string, then discover that it starts with a BOM, and thus decide it is UTF-8. BTW, the very first version of this AI had different types, but the ARG felt that it would just complicate the interface for the sake of abusive "purity". -- J-P. Rosen Adalog 2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX Tel: +33 1 45 29 21 52 https://www.adalog.fr ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:24 ` J-P. Rosen @ 2021-04-20 19:13 ` Randy Brukardt 2022-04-03 18:04 ` Thomas 1 sibling, 0 replies; 63+ messages in thread From: Randy Brukardt @ 2021-04-20 19:13 UTC (permalink / raw) [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 1464 bytes --] "J-P. Rosen" <rosen@adalog.fr> wrote in message news:s5k0ai$bb5$1@dont-email.me... > Le 19/04/2021 à 15:00, Luke A. Guest a écrit : >> They're different types and should be incompatible, because, well, they >> are. What does Ada have that allows for this that other languages >> doesn't? Oh yeah! Types! > > They are not so different. For example, you may read the first line of a > file in a string, then discover that it starts with a BOM, and thus decide > it is UTF-8. > > BTW, the very first version of this AI had different types, but the ARG > felt that it would just complicate the interface for the sake of abusive > "purity". Unfortunately, that was the first instance that showed the beginning of the end for Ada. If I remember correctly (and I may not ;-), that came from some people who were wedded to the Linux model where nothing is checked (or IMHO, typed). For them, a String is simply a bucket of octets. That prevented putting an encoding of any sort of any type on file names ("it should just work on Linux, that's what people expect"). The rest follows from that. Those of us who care about strong typing were disgusted, the result essentially does not work on Windows or MacOS (which do check the content of file names - as you can see in GNAT compiling units with non-Latin-1 characters in their names), and I don't really expect any recovery from that. Randy. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:24 ` J-P. Rosen 2021-04-20 19:13 ` Randy Brukardt @ 2022-04-03 18:04 ` Thomas 2022-04-06 18:57 ` J-P. Rosen 1 sibling, 1 reply; 63+ messages in thread From: Thomas @ 2022-04-03 18:04 UTC (permalink / raw) In article <s5k0ai$bb5$1@dont-email.me>, "J-P. Rosen" <rosen@adalog.fr> wrote: > Le 19/04/2021 à 15:00, Luke A. Guest a écrit : > > They're different types and should be incompatible, because, well, they > > are. What does Ada have that allows for this that other languages > > doesn't? Oh yeah! Types! > > They are not so different. For example, you may read the first line of a > file in a string, then discover that it starts with a BOM, and thus > decide it is UTF-8. could you give me an example of sth that you can do yet, and you could not do if UTF_8_String was private, please? (to discover that it starts with a BOM, you must look at it.) > > BTW, the very first version of this AI had different types, but the ARG > felt that it would just complicate the interface for the sake of abusive > "purity". could you explain "abusive purity" please? i guess it is because of ASCII. i guess a lot of developpers use only ASCII in a lot of situation, and they would find annoying to need Ada.Strings.UTF_Encoding.Strings every time. but I think a simple explicit conversion is acceptable, for a not fully compatible type which requires some attention. the best would be to be required to use ASCII_String as intermediate, but i don't know how it could be designed at language level: UTF_8_Var := UTF_8_String (ASCII_String (Latin_1_Var)); Latin_1_Var:= String (ASCII_String (UTF_8_Var)); and this would be forbidden : UTF_8_Var := UTF_8_String (Latin_1_Var); this would ensures to raise Constraint_Error when there are somme non-ASCII characters. -- RAPID maintainer http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-03 18:04 ` Thomas @ 2022-04-06 18:57 ` J-P. Rosen 2022-04-07 1:30 ` Randy Brukardt 0 siblings, 1 reply; 63+ messages in thread From: J-P. Rosen @ 2022-04-06 18:57 UTC (permalink / raw) Le 03/04/2022 à 21:04, Thomas a écrit : >> They are not so different. For example, you may read the first line of a >> file in a string, then discover that it starts with a BOM, and thus >> decide it is UTF-8. > > could you give me an example of sth that you can do yet, and you could > not do if UTF_8_String was private, please? > (to discover that it starts with a BOM, you must look at it.) Just what I said above, since a BOM is not a valid UTF-8 (otherwise, it could not be recognized). >> >> BTW, the very first version of this AI had different types, but the ARG >> felt that it would just complicate the interface for the sake of abusive >> "purity". > > could you explain "abusive purity" please? > It was felt that in practice, being too strict in separating the types would make things more difficult, without any practical gain. This has been discussed - you may not agree with the outcome, but it was not made out of pure lazyness -- J-P. Rosen Adalog 2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX Tel: +33 1 45 29 21 52 https://www.adalog.fr ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-06 18:57 ` J-P. Rosen @ 2022-04-07 1:30 ` Randy Brukardt 2022-04-08 8:56 ` Simon Wright 0 siblings, 1 reply; 63+ messages in thread From: Randy Brukardt @ 2022-04-07 1:30 UTC (permalink / raw) "J-P. Rosen" <rosen@adalog.fr> wrote in message news:t2knpr$s26$1@dont-email.me... ... > It was felt that in practice, being too strict in separating the types > would make things more difficult, without any practical gain. This has > been discussed - you may not agree with the outcome, but it was not made > out of pure lazyness The problem with that, of course, is that it sends the wrong message vis-a-vis strong typing and interfaces. If we abandon it at the first sign of trouble, they we are saying that it isn't really that important. In this particular case, the reason really came down to practicality: if you want to do anything string-like with a UTF-8 string, making it a separate type becomes painful. It wouldn't work with anything in Ada.Strings, Ada.Text_IO, or Ada.Directories, even though most of the operations are fine. And there was no political will to replace all of those things with versions to use with proper universal strings. Moreover, if you really want to do that, you have to hide much of the array behavior of the Universal string. For instance, you can't allow willy-nilly slicing or replacement: cutting a character representation in half or setting an illegal representation has to be prohibited (operations that would turn a valid string into an invalid string should always raise an exception). That means you can't (directly) use built-in indexing and slicing -- those have to go through some sort of functions. So you do pretty much have to use a private type for universal strings (similar to Ada.Strings.Bounded would be best, I think). If you had an Ada-like language that used a universal UTF-8 string internally, you then would have a lot of old and mostly useless operations supported for array types (since things like slices are mainly useful for string operations). So such a language should simplify the core substantially by dropping many of those obsolete features (especially as little of the library would be directly compatible anyway). So one should end up with a new language that draws from Ada rather than something in Ada itself. (It would be great if that language could make strings with different capacities interoperable - a major annoyance with Ada. And modernizing access types, generalizing resolution, and the like also would be good improvements IMHO.) Randy. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-07 1:30 ` Randy Brukardt @ 2022-04-08 8:56 ` Simon Wright 2022-04-08 9:26 ` Dmitry A. Kazakov 0 siblings, 1 reply; 63+ messages in thread From: Simon Wright @ 2022-04-08 8:56 UTC (permalink / raw) "Randy Brukardt" <randy@rrsoftware.com> writes: > If you had an Ada-like language that used a universal UTF-8 string > internally, you then would have a lot of old and mostly useless > operations supported for array types (since things like slices are > mainly useful for string operations). Just off the top of my head, wouldn't it be better to use UTF32-encoded Wide_Wide_Character internally? (you would still have trouble with e.g. national flag emojis :) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-08 8:56 ` Simon Wright @ 2022-04-08 9:26 ` Dmitry A. Kazakov 2022-04-08 19:19 ` Simon Wright 0 siblings, 1 reply; 63+ messages in thread From: Dmitry A. Kazakov @ 2022-04-08 9:26 UTC (permalink / raw) On 2022-04-08 10:56, Simon Wright wrote: > "Randy Brukardt" <randy@rrsoftware.com> writes: > >> If you had an Ada-like language that used a universal UTF-8 string >> internally, you then would have a lot of old and mostly useless >> operations supported for array types (since things like slices are >> mainly useful for string operations). > > Just off the top of my head, wouldn't it be better to use UTF32-encoded > Wide_Wide_Character internally? Yep, that is the exactly the problem, a confusion between interface and implementation. Encoding /= interface, e.g. an interface of a string viewed as an array of characters. That interface just same for ASCII, Latin-1, EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside? Ada type system's inability to implement this interface is another issue. Usefulness of this interface is yet another. For immutable strings it is quite useful. For mutable strings it might appear too constrained, e.g. for packed encodings like UTF-8 and UTF-16. Also this interface should have nothing to do with the interface of an UTF-8 string as an array of octets or the interface of an UTF-16LE string as an array of little endian words. Since Ada cannot separate these interfaces, for practical purposes, Strings are arrays of octets considered as UTF-8 encoding. The rest goes into coding guidelines under the title "never ever do this." -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-08 9:26 ` Dmitry A. Kazakov @ 2022-04-08 19:19 ` Simon Wright 2022-04-08 19:45 ` Dmitry A. Kazakov 0 siblings, 1 reply; 63+ messages in thread From: Simon Wright @ 2022-04-08 19:19 UTC (permalink / raw) "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: > On 2022-04-08 10:56, Simon Wright wrote: >> "Randy Brukardt" <randy@rrsoftware.com> writes: >> >>> If you had an Ada-like language that used a universal UTF-8 string >>> internally, you then would have a lot of old and mostly useless >>> operations supported for array types (since things like slices are >>> mainly useful for string operations). >> >> Just off the top of my head, wouldn't it be better to use >> UTF32-encoded Wide_Wide_Character internally? > > Yep, that is the exactly the problem, a confusion between interface > and implementation. Don't understand. My point was that *when you are implementing this* it mught be easier to deal with 32-bit charactrs/code points/whatever the proper jargon is than with UTF8. > Encoding /= interface, e.g. an interface of a string viewed as an > array of characters. That interface just same for ASCII, Latin-1, > EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside? With a user's hat on, I don't. Implementers might have a different point of view. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-08 19:19 ` Simon Wright @ 2022-04-08 19:45 ` Dmitry A. Kazakov 2022-04-09 4:05 ` Randy Brukardt 0 siblings, 1 reply; 63+ messages in thread From: Dmitry A. Kazakov @ 2022-04-08 19:45 UTC (permalink / raw) On 2022-04-08 21:19, Simon Wright wrote: > "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: > >> On 2022-04-08 10:56, Simon Wright wrote: >>> "Randy Brukardt" <randy@rrsoftware.com> writes: >>> >>>> If you had an Ada-like language that used a universal UTF-8 string >>>> internally, you then would have a lot of old and mostly useless >>>> operations supported for array types (since things like slices are >>>> mainly useful for string operations). >>> >>> Just off the top of my head, wouldn't it be better to use >>> UTF32-encoded Wide_Wide_Character internally? >> >> Yep, that is the exactly the problem, a confusion between interface >> and implementation. > > Don't understand. My point was that *when you are implementing this* it > mught be easier to deal with 32-bit charactrs/code points/whatever the > proper jargon is than with UTF8. I think it would be more difficult, because you will have to convert from and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface standard and I/O standard. That would be 60-70% of all cases you need a string. Most string operations like search, comparison, slicing are isomorphic between code points and octets. So you would win nothing from keeping strings internally as arrays of code points. The situation is comparable to Unbounded_Strings. The implementation is relatively simple, but the user must carry the burden of calling To_String and To_Unbounded_String all over the application and the processor must suffer the overhead of copying arrays here and there. >> Encoding /= interface, e.g. an interface of a string viewed as an >> array of characters. That interface just same for ASCII, Latin-1, >> EBCDIC, RADIX50, UTF-8 etc strings. Why do you care what is inside? > > With a user's hat on, I don't. Implementers might have a different point > of view. Sure, but in Ada philosophy their opinion should carry less weight, than, say, in C. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-08 19:45 ` Dmitry A. Kazakov @ 2022-04-09 4:05 ` Randy Brukardt 2022-04-09 7:43 ` Simon Wright 2022-04-09 10:27 ` DrPi 0 siblings, 2 replies; 63+ messages in thread From: Randy Brukardt @ 2022-04-09 4:05 UTC (permalink / raw) "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message news:t2q3cb$bbt$1@gioia.aioe.org... > On 2022-04-08 21:19, Simon Wright wrote: >> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: >> >>> On 2022-04-08 10:56, Simon Wright wrote: >>>> "Randy Brukardt" <randy@rrsoftware.com> writes: >>>> >>>>> If you had an Ada-like language that used a universal UTF-8 string >>>>> internally, you then would have a lot of old and mostly useless >>>>> operations supported for array types (since things like slices are >>>>> mainly useful for string operations). >>>> >>>> Just off the top of my head, wouldn't it be better to use >>>> UTF32-encoded Wide_Wide_Character internally? >>> >>> Yep, that is the exactly the problem, a confusion between interface >>> and implementation. >> >> Don't understand. My point was that *when you are implementing this* it >> mught be easier to deal with 32-bit charactrs/code points/whatever the >> proper jargon is than with UTF8. > > I think it would be more difficult, because you will have to convert from > and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface > standard and I/O standard. That would be 60-70% of all cases you need a > string. Most string operations like search, comparison, slicing are > isomorphic between code points and octets. So you would win nothing from > keeping strings internally as arrays of code points. I basically agree with Dmitry here. The internal representation is an implementation detail, but it seems likely that you would want to store UTF-8 strings directly; they're almost always going to be half the size (even for languages using their own characters like Greek) and for most of us, they'll be just a bit more than a quarter the size. The amount of bytes you copy around matters; the number of operations where code points are needed is fairly small. The main problem with UTF-8 is representing the code point positions in a way that they (a) aren't abused and (b) don't cost too much to calculate. Just using character indexes is too expensive for UTF-8 and UTF-16 representations, and using octet indexes is unsafe (since the splitting a character representation is a possibility). I'd probably use an abstract character position type that was implemented with an octet index under the covers. I think that would work OK as doing math on those is suspicious with a UTF representation. We're spoiled from using Latin-1 representations, of course, but generally one is interested in 5 characters, not 5 octets. And the number of octets in 5 characters depends on the string. So most of the sorts of operations that I tend to do (for instance from some code I was fixing earlier today): if Fort'Length > 6 and then Font(2..6) = "Arial" then This would be a bad idea if one is using any sort of universal representation -- you don't know how many octets is in the string literal so you can't assume a number in the test string. So the slice is dangerous (even though in this particular case it would be OK since the test string is all Ascii characters -- but I wouldn't want users to get in the habit of assuming such things). [BTW, the above was a bad idea anyway, because it turns out that the function in the Ada library returned bounds that don't start at 1. So the slice was usually out of range -- which is why I was looking at the code. Another thing that we could do without. Slices are evil, since they *seem* to be the right solution, yet rarely are in practice without a lot of hoops.] > The situation is comparable to Unbounded_Strings. The implementation is > relatively simple, but the user must carry the burden of calling To_String > and To_Unbounded_String all over the application and the processor must > suffer the overhead of copying arrays here and there. Yes, but that happens because Ada doesn't really have a string abstraction, so when you try to build one, you can't fully do the job. One presumes that a new language with a universal UTF-8 string wouldn't have that problem. (As previously noted, I don't see much point in trying to patch up Ada with a bunch of UTF-8 string packages; you would need an entire new set of Ada.Strings libraries and I/O libraries, and then you'd have all of the old stuff messing up resolution, using the best names, and confusing everything. A cleaner slate is needed.) Randy. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-09 4:05 ` Randy Brukardt @ 2022-04-09 7:43 ` Simon Wright 2022-04-09 10:27 ` DrPi 1 sibling, 0 replies; 63+ messages in thread From: Simon Wright @ 2022-04-09 7:43 UTC (permalink / raw) "Randy Brukardt" <randy@rrsoftware.com> writes: > "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message > news:t2q3cb$bbt$1@gioia.aioe.org... >> On 2022-04-08 21:19, Simon Wright wrote: >>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: >>> >>>> On 2022-04-08 10:56, Simon Wright wrote: >>>>> "Randy Brukardt" <randy@rrsoftware.com> writes: >>>>> >>>>>> If you had an Ada-like language that used a universal UTF-8 string >>>>>> internally, you then would have a lot of old and mostly useless >>>>>> operations supported for array types (since things like slices are >>>>>> mainly useful for string operations). >>>>> >>>>> Just off the top of my head, wouldn't it be better to use >>>>> UTF32-encoded Wide_Wide_Character internally? >>>> >>>> Yep, that is the exactly the problem, a confusion between interface >>>> and implementation. >>> >>> Don't understand. My point was that *when you are implementing this* it >>> mught be easier to deal with 32-bit charactrs/code points/whatever the >>> proper jargon is than with UTF8. >> >> I think it would be more difficult, because you will have to convert from >> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface >> standard and I/O standard. That would be 60-70% of all cases you need a >> string. Most string operations like search, comparison, slicing are >> isomorphic between code points and octets. So you would win nothing from >> keeping strings internally as arrays of code points. > > I basically agree with Dmitry here. The internal representation is an > implementation detail, but it seems likely that you would want to store > UTF-8 strings directly; they're almost always going to be half the size > (even for languages using their own characters like Greek) and for most of > us, they'll be just a bit more than a quarter the size. The amount of bytes > you copy around matters; the number of operations where code points are > needed is fairly small. Well, I don't have any skin in this game, so I'll shut up at this point. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-09 4:05 ` Randy Brukardt 2022-04-09 7:43 ` Simon Wright @ 2022-04-09 10:27 ` DrPi 2022-04-09 16:46 ` Dennis Lee Bieber 2022-04-10 5:58 ` Vadim Godunko 1 sibling, 2 replies; 63+ messages in thread From: DrPi @ 2022-04-09 10:27 UTC (permalink / raw) Le 09/04/2022 à 06:05, Randy Brukardt a écrit : > "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message > news:t2q3cb$bbt$1@gioia.aioe.org... >> On 2022-04-08 21:19, Simon Wright wrote: >>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes: >>> >>>> On 2022-04-08 10:56, Simon Wright wrote: >>>>> "Randy Brukardt" <randy@rrsoftware.com> writes: >>>>> >>>>>> If you had an Ada-like language that used a universal UTF-8 string >>>>>> internally, you then would have a lot of old and mostly useless >>>>>> operations supported for array types (since things like slices are >>>>>> mainly useful for string operations). >>>>> >>>>> Just off the top of my head, wouldn't it be better to use >>>>> UTF32-encoded Wide_Wide_Character internally? >>>> >>>> Yep, that is the exactly the problem, a confusion between interface >>>> and implementation. >>> >>> Don't understand. My point was that *when you are implementing this* it >>> mught be easier to deal with 32-bit charactrs/code points/whatever the >>> proper jargon is than with UTF8. >> >> I think it would be more difficult, because you will have to convert from >> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface >> standard and I/O standard. That would be 60-70% of all cases you need a >> string. Most string operations like search, comparison, slicing are >> isomorphic between code points and octets. So you would win nothing from >> keeping strings internally as arrays of code points. > > I basically agree with Dmitry here. The internal representation is an > implementation detail, but it seems likely that you would want to store > UTF-8 strings directly; they're almost always going to be half the size > (even for languages using their own characters like Greek) and for most of > us, they'll be just a bit more than a quarter the size. The amount of bytes > you copy around matters; the number of operations where code points are > needed is fairly small. > > The main problem with UTF-8 is representing the code point positions in a > way that they (a) aren't abused and (b) don't cost too much to calculate. > Just using character indexes is too expensive for UTF-8 and UTF-16 > representations, and using octet indexes is unsafe (since the splitting a > character representation is a possibility). I'd probably use an abstract > character position type that was implemented with an octet index under the > covers. > > I think that would work OK as doing math on those is suspicious with a UTF > representation. We're spoiled from using Latin-1 representations, of course, > but generally one is interested in 5 characters, not 5 octets. And the > number of octets in 5 characters depends on the string. So most of the sorts > of operations that I tend to do (for instance from some code I was fixing > earlier today): > > if Fort'Length > 6 and then > Font(2..6) = "Arial" then > > This would be a bad idea if one is using any sort of universal > representation -- you don't know how many octets is in the string literal so > you can't assume a number in the test string. So the slice is dangerous > (even though in this particular case it would be OK since the test string is > all Ascii characters -- but I wouldn't want users to get in the habit of > assuming such things). > > [BTW, the above was a bad idea anyway, because it turns out that the > function in the Ada library returned bounds that don't start at 1. So the > slice was usually out of range -- which is why I was looking at the code. > Another thing that we could do without. Slices are evil, since they *seem* > to be the right solution, yet rarely are in practice without a lot of > hoops.] > >> The situation is comparable to Unbounded_Strings. The implementation is >> relatively simple, but the user must carry the burden of calling To_String >> and To_Unbounded_String all over the application and the processor must >> suffer the overhead of copying arrays here and there. > > Yes, but that happens because Ada doesn't really have a string abstraction, > so when you try to build one, you can't fully do the job. One presumes that > a new language with a universal UTF-8 string wouldn't have that problem. (As > previously noted, I don't see much point in trying to patch up Ada with a > bunch of UTF-8 string packages; you would need an entire new set of > Ada.Strings libraries and I/O libraries, and then you'd have all of the old > stuff messing up resolution, using the best names, and confusing everything. > A cleaner slate is needed.) > > Randy. > > In Python-2, there is the same kind of problem. A string is a byte array. This is the programmer responsibility to encode/decode to/from UTF8/Latin1/... and to manage everything correctly. Litteral strings can be considered as encoded or decoded depending on the notation ("" or u""). In Python-3, a string is a character(glyph ?) array. The internal representation is hidden to the programmer. UTF8/Latin1/... encoded "strings" are of type bytes (byte array). Writing/reading to/from a file is done with bytes type. When writing/reading to/from a file in text mode, you have to specify the encoding to use. The encoding/decoding is then internally managed. As a general rule, all "external communications" are done with bytes (byte array). This is the programmer responsability to encode/decode where needed to convert from/to strings. The source files (.py) are considered to be UTF8 encoded by default but one can declare the actual encoding at the top of the file in a special comment tag. When a badly encoded character is found, an exception is raised at parsing time. So, literal strings are real strings, not bytes. I think the Python-3 way of doing things is much more understandable and really usable. On the Ada side, I've still not understood how to correctly deal with all this stuff. Note : In Python-3, bytes type is not reserved to encoded "strings". It is a versatile type for what it's named : a byte array. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-09 10:27 ` DrPi @ 2022-04-09 16:46 ` Dennis Lee Bieber 2022-04-09 18:59 ` DrPi 2022-04-10 5:58 ` Vadim Godunko 1 sibling, 1 reply; 63+ messages in thread From: Dennis Lee Bieber @ 2022-04-09 16:46 UTC (permalink / raw) On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <314@drpi.fr> declaimed the following: > >In Python-3, a string is a character(glyph ?) array. The internal >representation is hidden to the programmer. <SNIP> > >On the Ada side, I've still not understood how to correctly deal with >all this stuff. One thing to take into account is that Python strings are immutable. Changing the contents of a string requires constructing a new string from parts that incorporate the change. That allows for the second aspect -- even if not visible to a programmer, Python (3) strings are not a fixed representation: If all characters in the string fit in the 8-bit UTF range, that string is stored using one byte per character. If any character uses a 16-bit UTF representation, the entire string is stored as 16-bit characters (and similar for 32-bit UTF points). Thus, indexing into the string is still fast -- just needing to scale the index by the character width of the entire string. -- Wulfraed Dennis Lee Bieber AF6VN wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-09 16:46 ` Dennis Lee Bieber @ 2022-04-09 18:59 ` DrPi 0 siblings, 0 replies; 63+ messages in thread From: DrPi @ 2022-04-09 18:59 UTC (permalink / raw) Le 09/04/2022 à 18:46, Dennis Lee Bieber a écrit : > On Sat, 9 Apr 2022 12:27:04 +0200, DrPi <314@drpi.fr> declaimed the > following: > >> >> In Python-3, a string is a character(glyph ?) array. The internal >> representation is hidden to the programmer. > > <SNIP> >> >> On the Ada side, I've still not understood how to correctly deal with >> all this stuff. > > One thing to take into account is that Python strings are immutable. > Changing the contents of a string requires constructing a new string from > parts that incorporate the change. > Right. I forgot to mention it. > That allows for the second aspect -- even if not visible to a > programmer, Python (3) strings are not a fixed representation: If all > characters in the string fit in the 8-bit UTF range, that string is stored > using one byte per character. If any character uses a 16-bit UTF > representation, the entire string is stored as 16-bit characters (and > similar for 32-bit UTF points). Thus, indexing into the string is still > fast -- just needing to scale the index by the character width of the > entire string. > Thanks for clarifying. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-09 10:27 ` DrPi 2022-04-09 16:46 ` Dennis Lee Bieber @ 2022-04-10 5:58 ` Vadim Godunko 2022-04-10 18:59 ` DrPi 2022-04-12 6:13 ` Randy Brukardt 1 sibling, 2 replies; 63+ messages in thread From: Vadim Godunko @ 2022-04-10 5:58 UTC (permalink / raw) On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote: > > On the Ada side, I've still not understood how to correctly deal with > all this stuff. > Take a look at https://github.com/AdaCore/VSS Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder. I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems; in other cases their use generates a lot of hidden issues, which is very hard to detect. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-10 5:58 ` Vadim Godunko @ 2022-04-10 18:59 ` DrPi 2022-04-12 6:13 ` Randy Brukardt 1 sibling, 0 replies; 63+ messages in thread From: DrPi @ 2022-04-10 18:59 UTC (permalink / raw) Le 10/04/2022 à 07:58, Vadim Godunko a écrit : > On Saturday, April 9, 2022 at 1:27:08 PM UTC+3, DrPi wrote: >> >> On the Ada side, I've still not understood how to correctly deal with >> all this stuff. >> > Take a look at https://github.com/AdaCore/VSS > > Ideas behind this library is close to ideas of types separation in Python3. String is a Virtual_String, byte sequence is Stream_Element_Vector. Need to convert byte stream to string or back - use Virtual_String_Encoder/Virtual_String_Decoder. > > I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and programming languages; more cleaner types and API is a requirement now. The only case when old character/string types is really makes value is low resources embedded systems; in other cases their use generates a lot of hidden issues, which is very hard to detect. That's an interesting solution. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-10 5:58 ` Vadim Godunko 2022-04-10 18:59 ` DrPi @ 2022-04-12 6:13 ` Randy Brukardt 1 sibling, 0 replies; 63+ messages in thread From: Randy Brukardt @ 2022-04-12 6:13 UTC (permalink / raw) "Vadim Godunko" <vgodunko@gmail.com> wrote in message news:3962d55d-10e8-4dff-9ad3-847d69c3c337n@googlegroups.com... ... >I think ((Wide_)Wide_)(Character|String) is obsolete for modern systems and >programming languages; more cleaner types and API is a requirement now. ...which essentially means Ada is obsolete in your view, as String in particular is way too embedded in the definition and the language-defined units to use anything else. You'd end up with a mass of conversions to get anything done (the main problem with Ada.Strings.Unbounded). Or I suppose you could replace pretty much the entire library with a new one. But now you have two of everything to confuse newcomers and you still have a mass of old nonsense weighing down the language and complicating implementations. >The only case when old character/string types is really makes value is low >resources embedded systems; ... ...which of course is at least 50% of the use of Ada, and probably closer to 90% of the money. Any solution for Ada has to continue to meet the needs of embedded programmers. For instance, it would need to support fixed, bounded, and unbounded versions (solely having unbounded strings would not work for many applications, and indeed not just embedded systems need to restrict those -- any long-running server has to control dynamic allocation) >...in other cases their use generates a lot of hidden issues, which is very >hard to detect. At least some of which occur because a string is not an array, and the forcible mapping to them never worked very well. The Z-80 Pascals that we used to implement the very earliest versions of Ada had more functional strings than Ada does (by being bounded and using a library for most operations) - they would have been way easier to extend (as the Python ones were, as an example). Randy. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:00 ` Luke A. Guest 2021-04-19 13:10 ` Dmitry A. Kazakov 2021-04-19 13:24 ` J-P. Rosen @ 2021-04-19 16:07 ` DrPi 2021-04-20 19:06 ` Randy Brukardt 3 siblings, 0 replies; 63+ messages in thread From: DrPi @ 2021-04-19 16:07 UTC (permalink / raw) Le 19/04/2021 à 15:00, Luke A. Guest a écrit : > > > On 19/04/2021 13:52, Dmitry A. Kazakov wrote: > >> It is practical solution. Ada type system cannot express differently > represented/constrained string/array/vector subtypes. Ignoring Latin-1 > and using String as if it were an array of octets is the best available > solution. >> > > They're different types and should be incompatible, because, well, they > are. What does Ada have that allows for this that other languages > doesn't? Oh yeah! Types! I agree. In Python2, encoded and "decoded" strings are of same type "str". Bad design. In Python3, "decoded" strings are of type "str" and encoded strings are of type "bytes" (byte array). Both are different things and can't be assigned one to the other. Much more clear for the programmer. It should the same in Ada. Different types. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:00 ` Luke A. Guest ` (2 preceding siblings ...) 2021-04-19 16:07 ` DrPi @ 2021-04-20 19:06 ` Randy Brukardt 2022-04-03 18:37 ` Thomas 3 siblings, 1 reply; 63+ messages in thread From: Randy Brukardt @ 2021-04-20 19:06 UTC (permalink / raw) "Luke A. Guest" <laguest@archeia.com> wrote in message news:s5jute$1s08$1@gioia.aioe.org... > > > On 19/04/2021 13:52, Dmitry A. Kazakov wrote: > > > It is practical solution. Ada type system cannot express differently > represented/constrained string/array/vector subtypes. Ignoring Latin-1 and > using String as if it were an array of octets is the best available > solution. > > > > They're different types and should be incompatible, because, well, they > are. What does Ada have that allows for this that other languages doesn't? > Oh yeah! Types! If they're incompatible, you need an automatic way to convert between representations, since these are all views of the same thing (an abstract string type). You really don't want 35 versions of Open each taking a different string type. It's the fact that Ada can't do this that makes Unbounded_Strings unusable (well, barely usable). Ada 202x fixes the literal problem at least, but we'd have to completely abandon Unbounded_Strings and use a different library design in order for for it to allow literals. And if you're going to do that, you might as well do something about UTF-8 as well -- but now you're going to need even more conversions. Yuck. I think the only true solution here would be based on a proper abstract Root_String type. But that wouldn't work in Ada, since it would be incompatible with all of the existing code out there. Probably would have to wait for a follow-on language. Randy. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-20 19:06 ` Randy Brukardt @ 2022-04-03 18:37 ` Thomas 2022-04-04 23:52 ` Randy Brukardt 0 siblings, 1 reply; 63+ messages in thread From: Thomas @ 2022-04-03 18:37 UTC (permalink / raw) In article <s5n8nj$cec$1@franka.jacob-sparre.dk>, "Randy Brukardt" <randy@rrsoftware.com> wrote: > "Luke A. Guest" <laguest@archeia.com> wrote in message > news:s5jute$1s08$1@gioia.aioe.org... > > > > > > On 19/04/2021 13:52, Dmitry A. Kazakov wrote: > > > > > It is practical solution. Ada type system cannot express differently > > represented/constrained string/array/vector subtypes. Ignoring Latin-1 and > > using String as if it were an array of octets is the best available > > solution. > > > > > > > They're different types and should be incompatible, because, well, they > > are. What does Ada have that allows for this that other languages doesn't? > > Oh yeah! Types! > > If they're incompatible, you need an automatic way to convert between > representations, since these are all views of the same thing (an abstract > string type). You really don't want 35 versions of Open each taking a > different string type. i need not 35 versions of Open. i need a version of Open with an Unicode string type (not Latin-1 - preferably UTF-8), which will use Ada.Strings.UTF_Encoding.Conversions as far as needed, regarding the underlying API. > > It's the fact that Ada can't do this that makes Unbounded_Strings unusable > (well, barely usable). knowing Ada, i find it acceptable. i don't say the same about Ada.Strings.UTF_Encoding.UTF_8_String. > Ada 202x fixes the literal problem at least, but we'd > have to completely abandon Unbounded_Strings and use a different library > design in order for for it to allow literals. And if you're going to do > that, you might as well do something about UTF-8 as well -- but now you're > going to need even more conversions. Yuck. as i said to Vadim Godunko, i need to fill a string type with an UTF-8 litteral. but i don't think this string type has to manage various conversions. from my point of view, each library has to accept 1 kind of string type (preferably UTF-8 everywhere), and then, this library has to make needed conversions regarding the underlying API. not the user. > > I think the only true solution here would be based on a proper abstract > Root_String type. But that wouldn't work in Ada, since it would be > incompatible with all of the existing code out there. Probably would have to > wait for a follow-on language. of course, it would be very nice to have a more thicker language with a garbage collector, only 1 String type which allows all what we need, etc. -- RAPID maintainer http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-03 18:37 ` Thomas @ 2022-04-04 23:52 ` Randy Brukardt 2023-03-31 3:06 ` Thomas 0 siblings, 1 reply; 63+ messages in thread From: Randy Brukardt @ 2022-04-04 23:52 UTC (permalink / raw) "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr... ... > as i said to Vadim Godunko, i need to fill a string type with an UTF-8 > litteral.but i don't think this string type has to manage various > conversions. > > from my point of view, each library has to accept 1 kind of string type > (preferably UTF-8 everywhere), > and then, this library has to make needed conversions regarding the > underlying API. not the user. This certainly is a fine ivory tower solution, but it completely ignores two practicalities in the case of Ada: (1) You need to replace almost all of the existing Ada language defined packages to make this work. Things that are deeply embedded in both implementations and programs (like Ada.Exceptions and Ada.Text_IO) would have to change substantially. The result would essentially be a different language, since the resulting libraries would not work with most existing programs. They'd have to have different names (since if you used the same names, you change the failures from compile-time to runtime -- or even undetected -- which would be completely against the spirit of Ada), which means that one would have to essentially start over learning and using the resulting language. Calling it Ada would be rather silly, since it would be practically incompatible (and it would make sense to use this point to eliminate a lot of the cruft from the Ada design). (2) One needs to be able to read and write data given whatever encoding the project requires (that's often decided by outside forces, such as other hardware or software that the project needs to interoperate with). That means that completely hiding the encoding (or using a universal encoding) doesn't fully solve the problems faced by Ada programmers. At a minimum, you have to have a way to specify the encoding of files, streams, and hardware interfaces (this sort of thing is not provided by any common target OS, so it's not in any target API). That will greatly complicate the interface and implementation of the libraries. > ... of course, it would be very nice to have a more thicker language with > a garbage collector ... I doubt that you will ever see that in the Ada family, as analysis and therefore determinism is a very important property for the language. Ada has lots of mechanisms for managing storage without directly doing it yourself (by calling Unchecked_Deallocation), yet none of them use any garbage collection in a traditional sense. I could see more such mechanisms (an ownership option on the line of Rust could easily manage storage at the same time, since any object that could be orphaned could never be used again and thus should be reclaimed), but standard garbage collection is too non-deterministic for many of the uses Ada is put to. Randy. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-04 23:52 ` Randy Brukardt @ 2023-03-31 3:06 ` Thomas 2023-04-01 10:18 ` Randy Brukardt 0 siblings, 1 reply; 63+ messages in thread From: Thomas @ 2023-03-31 3:06 UTC (permalink / raw) In article <t2g0c1$eou$1@dont-email.me>, "Randy Brukardt" <randy@rrsoftware.com> wrote: > "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message > news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr... > ... > > as i said to Vadim Godunko, i need to fill a string type with an UTF-8 > > litteral.but i don't think this string type has to manage various > > conversions. > > > > from my point of view, each library has to accept 1 kind of string type > > (preferably UTF-8 everywhere), > > and then, this library has to make needed conversions regarding the > > underlying API. not the user. > > This certainly is a fine ivory tower solution, I like to think from an ivory tower, and then look at the reality to see what's possible to do or not. :-) > but it completely ignores two > practicalities in the case of Ada: > > (1) You need to replace almost all of the existing Ada language defined > packages to make this work. Things that are deeply embedded in both > implementations and programs (like Ada.Exceptions and Ada.Text_IO) would > have to change substantially. The result would essentially be a different > language, since the resulting libraries would not work with most existing > programs. - in Ada, of course we can't delete what's existing, and there are many packages which are already in 3 versions (S/WS/WWS). imho, it would be consistent to make a 4th version of them for a new UTF_8_String type. - in a new language close to Ada, it would not necessarily be a good idea to remove some of them, depending on industrial needs, to keep them with us. > They'd have to have different names (since if you used the same > names, you change the failures from compile-time to runtime -- or even > undetected -- which would be completely against the spirit of Ada), which > means that one would have to essentially start over learning and using the > resulting language. i think i don't understand. > (and it would make sense to use this point to > eliminate a lot of the cruft from the Ada design). could you give an example of cruft from the Ada design, please? :-) > > (2) One needs to be able to read and write data given whatever encoding the > project requires (that's often decided by outside forces, such as other > hardware or software that the project needs to interoperate with). > At a minimum, you > have to have a way to specify the encoding of files, streams, and hardware > interfaces > That will greatly complicate the interface and > implementation of the libraries. i don't think so. it's a matter of interfacing libraries, for the purpose of communicating with the outside (neither of internal libraries nor of the choice of the internal type for the implementation). Ada.Text_IO.Open.Form already allows (a part of?) this (on the content of the files, not on their name), see ARM A.10.2 (6-8). (write i the reference to ARM correctly?) > > > ... of course, it would be very nice to have a more thicker language with > > a garbage collector ... > > I doubt that you will ever see that in the Ada family, > as analysis and > therefore determinism is a very important property for the language. I completely agree :-) > Ada has > lots of mechanisms for managing storage without directly doing it yourself > (by calling Unchecked_Deallocation), yet none of them use any garbage > collection in a traditional sense. sorry, i meant "garbage collector" in a generic sense, not in a traditional sense. that is, as Ada users we could program with pointers and pool, without memory leaks nor calling Unchecked_Deallocation. for example Ada.Containers.Indefinite_Holders. i already wrote one for constrained limited types. do you know if it's possible to do it for unconstrained limited types, like the class of a limited tagged type? -- RAPID maintainer http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2023-03-31 3:06 ` Thomas @ 2023-04-01 10:18 ` Randy Brukardt 0 siblings, 0 replies; 63+ messages in thread From: Randy Brukardt @ 2023-04-01 10:18 UTC (permalink / raw) I'm not going to answer this point-by-point, as it would take very much too long, and there is a similar thread going on the ARG's Github (which needs my attention more than comp.lang.ada. But my opinion is that Ada got strings completely wrong, and the best thing to do with them is to completely nuke them and start over. But one cannot do that in the context of Ada, one would have to at least leave way to use the old mechanisms for compatibility with older code. That would leave a hodge-podge of mechanisms that would make Ada very much harder (rather than easier) to use. As far as the cruft goes, I wrote up a 20+ page document on that during the pandemic, but I could never interest anyone knowledgeable to review it, and I don't plan to make it available without that. Most of the things are caused by interactions -- mostly because of too much generality. And of course there are features that Ada would be better off without (like anonymous access types). Randy. "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message news:64264e2f$0$25952$426a74cc@news.free.fr... > In article <t2g0c1$eou$1@dont-email.me>, > "Randy Brukardt" <randy@rrsoftware.com> wrote: > >> "Thomas" <fantome.forums.tDeContes@free.fr.invalid> wrote in message >> news:fantome.forums.tDeContes-5E3B70.20370903042022@news.free.fr... >> ... >> > as i said to Vadim Godunko, i need to fill a string type with an UTF-8 >> > litteral.but i don't think this string type has to manage various >> > conversions. >> > >> > from my point of view, each library has to accept 1 kind of string type >> > (preferably UTF-8 everywhere), >> > and then, this library has to make needed conversions regarding the >> > underlying API. not the user. >> >> This certainly is a fine ivory tower solution, > > I like to think from an ivory tower, > and then look at the reality to see what's possible to do or not. :-) > > > >> but it completely ignores two >> practicalities in the case of Ada: >> >> (1) You need to replace almost all of the existing Ada language defined >> packages to make this work. Things that are deeply embedded in both >> implementations and programs (like Ada.Exceptions and Ada.Text_IO) would >> have to change substantially. The result would essentially be a different >> language, since the resulting libraries would not work with most existing >> programs. > > - in Ada, of course we can't delete what's existing, and there are many > packages which are already in 3 versions (S/WS/WWS). > imho, it would be consistent to make a 4th version of them for a new > UTF_8_String type. > > - in a new language close to Ada, it would not necessarily be a good > idea to remove some of them, depending on industrial needs, to keep them > with us. > >> They'd have to have different names (since if you used the same >> names, you change the failures from compile-time to runtime -- or even >> undetected -- which would be completely against the spirit of Ada), which >> means that one would have to essentially start over learning and using >> the >> resulting language. > > i think i don't understand. > >> (and it would make sense to use this point to >> eliminate a lot of the cruft from the Ada design). > > could you give an example of cruft from the Ada design, please? :-) > > >> >> (2) One needs to be able to read and write data given whatever encoding >> the >> project requires (that's often decided by outside forces, such as other >> hardware or software that the project needs to interoperate with). > >> At a minimum, you >> have to have a way to specify the encoding of files, streams, and >> hardware >> interfaces > >> That will greatly complicate the interface and >> implementation of the libraries. > > i don't think so. > it's a matter of interfacing libraries, for the purpose of communicating > with the outside (neither of internal libraries nor of the choice of the > internal type for the implementation). > > Ada.Text_IO.Open.Form already allows (a part of?) this (on the content > of the files, not on their name), see ARM A.10.2 (6-8). > (write i the reference to ARM correctly?) > > > >> >> > ... of course, it would be very nice to have a more thicker language >> > with >> > a garbage collector ... >> >> I doubt that you will ever see that in the Ada family, > >> as analysis and >> therefore determinism is a very important property for the language. > > I completely agree :-) > >> Ada has >> lots of mechanisms for managing storage without directly doing it >> yourself >> (by calling Unchecked_Deallocation), yet none of them use any garbage >> collection in a traditional sense. > > sorry, i meant "garbage collector" in a generic sense, not in a > traditional sense. > that is, as Ada users we could program with pointers and pool, without > memory leaks nor calling Unchecked_Deallocation. > > for example Ada.Containers.Indefinite_Holders. > > i already wrote one for constrained limited types. > do you know if it's possible to do it for unconstrained limited types, > like the class of a limited tagged type? > > -- > RAPID maintainer > http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 9:08 ` Stephen Leake 2021-04-19 9:34 ` Dmitry A. Kazakov 2021-04-19 11:56 ` Luke A. Guest @ 2021-04-19 16:14 ` DrPi 2021-04-19 17:12 ` Björn Lundin 2022-04-16 2:32 ` Thomas 3 siblings, 1 reply; 63+ messages in thread From: DrPi @ 2021-04-19 16:14 UTC (permalink / raw) Le 19/04/2021 à 11:08, Stephen Leake a écrit : > DrPi <314@drpi.fr> writes: > >> Any way to use source code encoded in UTF-8 ? > > for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8"); > That's interesting. Using these switches at project level is not OK. Project source files not always use the same encoding. Especially when using libraries. Using these switches at source level is better. A little bit complicated to use but better. > from the gnat user guide, 4.3.1 Alphabetical List of All Switches: > > `-gnati`c'' > Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w). For details > of the possible selections for `c', see *note Character Set > Control: 4e. > > This applies to identifiers in the source code > > `-gnatW`e'' > Wide character encoding method (`e'=n/h/u/s/e/8). > > This applies to string and character literals. > >> What's the way to manage Unicode correctly ? > > There are two issues: Unicode in source code, that the compiler must > understand, and Unicode in strings, that your program must understand. > > (I've never written a program that dealt with utf strings other than > file names). > > -gnati8 tells the compiler that the source code uses utf-8 encoding. > > -gnatW8 tells the compiler that string literals use utf-8 encoding. > > package Ada.Strings.UTF_Encoding provides some facilities for dealing > with utf. It does _not_ provide walking a string by code point, which > would seem necessary. > > We could be more helpful if you show what you are trying to do, you've > tried, and what errors you got. > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 16:14 ` DrPi @ 2021-04-19 17:12 ` Björn Lundin 2021-04-19 19:44 ` DrPi 0 siblings, 1 reply; 63+ messages in thread From: Björn Lundin @ 2021-04-19 17:12 UTC (permalink / raw) Den 2021-04-19 kl. 18:14, skrev DrPi: >> for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8"); >> > That's interesting. > Using these switches at project level is not OK. Project source files > not always use the same encoding. Especially when using libraries. > Using these switches at source level is better. A little bit complicated > to use but better. You did understand that the above setting only applies to the file called 'non_ascii.ads' - and not to the rest of the files? -- Björn ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 17:12 ` Björn Lundin @ 2021-04-19 19:44 ` DrPi 0 siblings, 0 replies; 63+ messages in thread From: DrPi @ 2021-04-19 19:44 UTC (permalink / raw) Le 19/04/2021 à 19:12, Björn Lundin a écrit : > Den 2021-04-19 kl. 18:14, skrev DrPi: >>> for Switches ("non_ascii.ads") use ("-gnatiw", "-gnatW8"); >>> >> That's interesting. >> Using these switches at project level is not OK. Project source files >> not always use the same encoding. Especially when using libraries. >> Using these switches at source level is better. A little bit >> complicated to use but better. > > You did understand that the above setting only applies to the file > called 'non_ascii.ads' - and not to the rest of the files? > > > Yes, that's what I've understood. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 9:08 ` Stephen Leake ` (2 preceding siblings ...) 2021-04-19 16:14 ` DrPi @ 2022-04-16 2:32 ` Thomas 3 siblings, 0 replies; 63+ messages in thread From: Thomas @ 2022-04-16 2:32 UTC (permalink / raw) In article <86mttuk5f0.fsf@stephe-leake.org>, Stephen Leake <stephen_leake@stephe-leake.org> wrote: > DrPi <314@drpi.fr> writes: > > > Any way to use source code encoded in UTF-8 ? > from the gnat user guide, 4.3.1 Alphabetical List of All Switches: > > `-gnati`c'' > Identifier character set (`c' = 1/2/3/4/8/9/p/f/n/w). For details > of the possible selections for `c', see *note Character Set > Control: 4e. > > This applies to identifiers in the source code > > `-gnatW`e'' > Wide character encoding method (`e'=n/h/u/s/e/8). > > This applies to string and character literals. afaik, -gnati is deactivated when -gnatW is not n or h (from memory) so you can't ask both to check that identifiers are in ASCII and to have literals in UTF-8. (if it's resolved in new versions it's a good news :-) ) -- RAPID maintainer http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-17 22:03 Ada and Unicode DrPi ` (2 preceding siblings ...) 2021-04-19 9:08 ` Stephen Leake @ 2021-04-19 13:18 ` Vadim Godunko 2022-04-03 16:51 ` Thomas 2021-04-19 22:40 ` Shark8 4 siblings, 1 reply; 63+ messages in thread From: Vadim Godunko @ 2021-04-19 13:18 UTC (permalink / raw) On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote: > > I have a good knowledge of Unicode : code points, encoding... > What I don't understand is how to manage Unicode strings with Ada. I've > read part of ARM and did some tests without success. > > I managed to be partly successful with source code encoded in Latin-1. > Any other encoding failed. > Any way to use source code encoded in UTF-8 ? > In some languages, it is possible to set a tag at the beginning of the > source file to direct the compiler which encoding to use. > I wasn't successful using -gnatW8 switch. But maybe I made to many tests > and my brain was scrambled. > > Even with source code encoded in Latin-1, I've not been able to manage > Unicode strings correctly. > > What's the way to manage Unicode correctly ? > Ada doesn't have good Unicode support. :( So, you need to find suitable set of "workarounds". There are few different aspects of Unicode support need to be considered: 1. Representation of string literals. If you want to use non-ASCII characters in source code, you need to use -gnatW8 switch and it will require use of Wide_Wide_String everywhere. 2. Internal representation during application execution. You are forced to use Wide_Wide_String at previous step, so it will be UCS4/UTF32. 3. Text encoding/decoding on input/output operations. GNAT allows to use UTF-8 by providing some magic string for Form parameter of Text_IO. It is hard to say that it is reasonable set of features for modern world. To fix some of drawbacks of current situation we are developing new text processing library, know as VSS. https://github.com/AdaCore/VSS At current stage it provides encoding independent API for text manipulation, encoders and decoders API for I/O, and JSON reader/writer; regexp support should come soon. Encoding independent API means that application always use Unicode characters to process text, independently from the real encoding used to store information in memory (UTF-8 is used for now, UTF-16 will be added later for interoperability with Windows API and WASM). Coders and encoders allow translation from/to different encodings when application exchange information with the world. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 13:18 ` Vadim Godunko @ 2022-04-03 16:51 ` Thomas 2023-04-04 0:02 ` Thomas 0 siblings, 1 reply; 63+ messages in thread From: Thomas @ 2022-04-03 16:51 UTC (permalink / raw) In article <f9d91cb0-c9bb-4d42-a1a9-0cd546da436cn@googlegroups.com>, Vadim Godunko <vgodunko@gmail.com> wrote: > On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote: > > What's the way to manage Unicode correctly ? > > > > Ada doesn't have good Unicode support. :( So, you need to find suitable set > of "workarounds". > > There are few different aspects of Unicode support need to be considered: > > 1. Representation of string literals. If you want to use non-ASCII characters > in source code, you need to use -gnatW8 switch and it will require use of > Wide_Wide_String everywhere. > 2. Internal representation during application execution. You are forced to > use Wide_Wide_String at previous step, so it will be UCS4/UTF32. > It is hard to say that it is reasonable set of features for modern world. I don't think Ada would be lacking that much, for having good UTF-8 support. the cardinal point is to be able to fill a Ada.Strings.UTF_Encoding.UTF_8_String with a litteral. (once you got it, when you'll try to fill a Standard.String with a non-Latin-1 character, it'll make an error, i think it's fine :-) ) does Ada 202x allow it ? if not, it would probably be easier if it was type UTF_8_String is new String; instead of subtype UTF_8_String is String; for all subprograms it's quite easy: we just have to duplicate them with the new type, and to mark the old one as Obsolescent. but, now that "subtype UTF_8_String" exists, i don't know what we can do for types. (is the only way to choose a new name?) > To > fix some of drawbacks of current situation we are developing new text > processing library, know as VSS. > > https://github.com/AdaCore/VSS (are you working at AdaCore ?) -- RAPID maintainer http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2022-04-03 16:51 ` Thomas @ 2023-04-04 0:02 ` Thomas 0 siblings, 0 replies; 63+ messages in thread From: Thomas @ 2023-04-04 0:02 UTC (permalink / raw) In article <fantome.forums.tDeContes-079FD6.18515603042022@news.free.fr>, Thomas <fantome.forums.tDeContes@free.fr.invalid> wrote: > In article <f9d91cb0-c9bb-4d42-a1a9-0cd546da436cn@googlegroups.com>, > Vadim Godunko <vgodunko@gmail.com> wrote: > > > On Sunday, April 18, 2021 at 1:03:14 AM UTC+3, DrPi wrote: > > > > What's the way to manage Unicode correctly ? > > Ada doesn't have good Unicode support. :( So, you need to find suitable set > > of "workarounds". > > > > There are few different aspects of Unicode support need to be considered: > > > > 1. Representation of string literals. If you want to use non-ASCII > > characters > > in source code, you need to use -gnatW8 switch and it will require use of > > Wide_Wide_String everywhere. > > 2. Internal representation during application execution. You are forced to > > use Wide_Wide_String at previous step, so it will be UCS4/UTF32. > > > It is hard to say that it is reasonable set of features for modern world. > > I don't think Ada would be lacking that much, for having good UTF-8 > support. > > the cardinal point is to be able to fill a > Ada.Strings.UTF_Encoding.UTF_8_String with a litteral. > (once you got it, when you'll try to fill a Standard.String with a > non-Latin-1 character, it'll make an error, i think it's fine :-) ) > > does Ada 202x allow it ? hi ! I think I found a quite nice solution! (reading <t3lj44$fh5$1@dont-email.me> again) (not tested yet) it's not perfect as in the rules of the art, but it is: - Ada 2012 compatible - better than writing UTF-8 Ada code and then telling gnat it is Latin-1 (in this way it would take UTF_8_String for what it is: an array of octets, but it would not detect an invalid UTF-8 string, and if someone tells it's really UTF-8 all goes wrong) - better than being limited to ASCII in string literals - never need to explicitely declare Wide_Wide_String: it's always implicit, for very short time, and AFAIK eligible for optimization package UTF_Encoding is subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String; function "+" (A : in Wide_Wide_String) return UTF_8_String renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode; end UTF_Encoding; then we can do: package User is use UTF_Encoding; My_String : UTF_8_String := + "Greek characters + smileys"; end User; if you want to avoid "use UTF_Encoding;", i think "use type UTF_Encoding.UTF_8_String;" doesn't work, but this should work: package UTF_Encoding is subtype UTF_8_String is Ada.Strings.UTF_Encoding.UTF_8_String; type Literals_For_UTF_8_String is new Wide_Wide_String; function "+" (A : in Literals_For_UTF_8_String) return UTF_8_String renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Encode; end UTF_Encoding; package User is use type UTF_Encoding.Literals_For_UTF_8_String; My_String : UTF_Encoding.UTF_8_String := + "Greek characters + smileys"; end User; what do you think about that ? good idea or not ? :-) -- RAPID maintainer http://savannah.nongnu.org/projects/rapid/ ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-17 22:03 Ada and Unicode DrPi ` (3 preceding siblings ...) 2021-04-19 13:18 ` Vadim Godunko @ 2021-04-19 22:40 ` Shark8 2021-04-20 15:05 ` Simon Wright 4 siblings, 1 reply; 63+ messages in thread From: Shark8 @ 2021-04-19 22:40 UTC (permalink / raw) On Saturday, April 17, 2021 at 4:03:14 PM UTC-6, DrPi wrote: > Hi, > > I have a good knowledge of Unicode : code points, encoding... > What I don't understand is how to manage Unicode strings with Ada. I've > read part of ARM and did some tests without success. > > I managed to be partly successful with source code encoded in Latin-1. Ah. Yes, this is an issue in GNAT, and possibly other compilers. The easiest method for me is to right-click the text-buffer for the file in GPS, click properties in the menu that pops up, then in the dialog select from the Character Set drop-down "Unicode UTF-#". > Any other encoding failed. > Any way to use source code encoded in UTF-8 ? There's the above method with GPS. IIRC there's also a Pragma and a compiler-flag for GNAT. It's actually a non-issue for Byron, because the file-reader does a BOM-check [IIRC defaulting to ASCII in the absence of a BOM] and outputs to the lexer the Wide_Wide_Character equivalent of the input-encoding. See: https://github.com/OneWingedShark/Byron/blob/master/src/reader/readington.adb > In some languages, it is possible to set a tag at the beginning of the > source file to direct the compiler which encoding to use. > I wasn't successful using -gnatW8 switch. But maybe I made to many tests > and my brain was scrambled. IIRC the gnatW8 flag sets it to UTF-8, so if your editor is saving in something else like UTF-16 BE, the compiler [probably] won't read it correctly. > Even with source code encoded in Latin-1, I've not been able to manage > Unicode strings correctly. > > What's the way to manage Unicode correctly ? I typically use the GPS file/properties method above, and then I might also use the pragma. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-19 22:40 ` Shark8 @ 2021-04-20 15:05 ` Simon Wright 2021-04-20 19:17 ` Randy Brukardt 0 siblings, 1 reply; 63+ messages in thread From: Simon Wright @ 2021-04-20 15:05 UTC (permalink / raw) Shark8 <onewingedshark@gmail.com> writes: > It's actually a non-issue for Byron, because the file-reader does a > BOM-check [IIRC defaulting to ASCII in the absence of a BOM] GNAT does a BOM-check also. gnatchop does one better, carrying the BOM from the top of the input file through to each output file. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-20 15:05 ` Simon Wright @ 2021-04-20 19:17 ` Randy Brukardt 2021-04-20 20:04 ` Simon Wright 0 siblings, 1 reply; 63+ messages in thread From: Randy Brukardt @ 2021-04-20 19:17 UTC (permalink / raw) "Simon Wright" <simon@pushface.org> wrote in message news:lybla9574t.fsf@pushface.org... > Shark8 <onewingedshark@gmail.com> writes: > >> It's actually a non-issue for Byron, because the file-reader does a >> BOM-check [IIRC defaulting to ASCII in the absence of a BOM] > > GNAT does a BOM-check also. gnatchop does one better, carrying the BOM > from the top of the input file through to each output file. That's what the documentation says, but it didn't work on ACATS source files (the few which use Unicode start with a BOM). I had to write a bunch of extra code in the script generator to stick the options on the Unicode files (that worked). Perhaps that's been fixed since, but I wouldn't trust it (burned once, twice shy). Randy. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: Ada and Unicode 2021-04-20 19:17 ` Randy Brukardt @ 2021-04-20 20:04 ` Simon Wright 0 siblings, 0 replies; 63+ messages in thread From: Simon Wright @ 2021-04-20 20:04 UTC (permalink / raw) "Randy Brukardt" <randy@rrsoftware.com> writes: > "Simon Wright" <simon@pushface.org> wrote in message > news:lybla9574t.fsf@pushface.org... >> Shark8 <onewingedshark@gmail.com> writes: >> >>> It's actually a non-issue for Byron, because the file-reader does a >>> BOM-check [IIRC defaulting to ASCII in the absence of a BOM] >> >> GNAT does a BOM-check also. gnatchop does one better, carrying the BOM >> from the top of the input file through to each output file. > > That's what the documentation says, but it didn't work on ACATS source files > (the few which use Unicode start with a BOM). I had to write a bunch of > extra code in the script generator to stick the options on the Unicode files > (that worked). Perhaps that's been fixed since, but I wouldn't trust it > (burned once, twice shy). It does now: just checked again with c250001, c250002. ^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2023-04-04 0:02 UTC | newest] Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-17 22:03 Ada and Unicode DrPi 2021-04-18 0:02 ` Luke A. Guest 2021-04-19 9:09 ` DrPi 2021-04-19 8:29 ` Maxim Reznik 2021-04-19 9:28 ` DrPi 2021-04-19 13:50 ` Maxim Reznik 2021-04-19 15:51 ` DrPi 2021-04-19 11:15 ` Simon Wright 2021-04-19 11:50 ` Luke A. Guest 2021-04-19 15:53 ` DrPi 2022-04-03 19:20 ` Thomas 2022-04-04 6:10 ` Vadim Godunko 2022-04-04 14:19 ` Simon Wright 2022-04-04 15:11 ` Simon Wright 2022-04-05 7:59 ` Vadim Godunko 2022-04-08 9:01 ` Simon Wright 2023-03-30 23:35 ` Thomas 2022-04-04 14:33 ` Simon Wright 2021-04-19 9:08 ` Stephen Leake 2021-04-19 9:34 ` Dmitry A. Kazakov 2021-04-19 11:56 ` Luke A. Guest 2021-04-19 12:13 ` Luke A. Guest 2021-04-19 15:48 ` DrPi 2021-04-19 12:52 ` Dmitry A. Kazakov 2021-04-19 13:00 ` Luke A. Guest 2021-04-19 13:10 ` Dmitry A. Kazakov 2021-04-19 13:15 ` Luke A. Guest 2021-04-19 13:31 ` Dmitry A. Kazakov 2022-04-03 17:24 ` Thomas 2021-04-19 13:24 ` J-P. Rosen 2021-04-20 19:13 ` Randy Brukardt 2022-04-03 18:04 ` Thomas 2022-04-06 18:57 ` J-P. Rosen 2022-04-07 1:30 ` Randy Brukardt 2022-04-08 8:56 ` Simon Wright 2022-04-08 9:26 ` Dmitry A. Kazakov 2022-04-08 19:19 ` Simon Wright 2022-04-08 19:45 ` Dmitry A. Kazakov 2022-04-09 4:05 ` Randy Brukardt 2022-04-09 7:43 ` Simon Wright 2022-04-09 10:27 ` DrPi 2022-04-09 16:46 ` Dennis Lee Bieber 2022-04-09 18:59 ` DrPi 2022-04-10 5:58 ` Vadim Godunko 2022-04-10 18:59 ` DrPi 2022-04-12 6:13 ` Randy Brukardt 2021-04-19 16:07 ` DrPi 2021-04-20 19:06 ` Randy Brukardt 2022-04-03 18:37 ` Thomas 2022-04-04 23:52 ` Randy Brukardt 2023-03-31 3:06 ` Thomas 2023-04-01 10:18 ` Randy Brukardt 2021-04-19 16:14 ` DrPi 2021-04-19 17:12 ` Björn Lundin 2021-04-19 19:44 ` DrPi 2022-04-16 2:32 ` Thomas 2021-04-19 13:18 ` Vadim Godunko 2022-04-03 16:51 ` Thomas 2023-04-04 0:02 ` Thomas 2021-04-19 22:40 ` Shark8 2021-04-20 15:05 ` Simon Wright 2021-04-20 19:17 ` Randy Brukardt 2021-04-20 20:04 ` Simon Wright
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox