* strange behaviour of utf-8 files @ 2013-11-16 13:12 Stoik 2013-11-16 13:34 ` Dmitry A. Kazakov 0 siblings, 1 reply; 33+ messages in thread From: Stoik @ 2013-11-16 13:12 UTC (permalink / raw) I am using gps 5.2.1 with utf-8 encoding in the editor. I tried to write a simple routine to strip the diacritical marks from Polish texts. When executing a test program, I got the "translation_error" message, and it turned out that the string consisting of Polish letters was treated as double the proper length. You can try for yourself: with s: string := "ó"; we get s'length=2. Where is the hook? Is it a compiler error, gps error, or my own one? Stoik ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 13:12 strange behaviour of utf-8 files Stoik @ 2013-11-16 13:34 ` Dmitry A. Kazakov 2013-11-16 15:09 ` Stoik 2013-11-16 15:12 ` Stoik 0 siblings, 2 replies; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-16 13:34 UTC (permalink / raw) On Sat, 16 Nov 2013 05:12:29 -0800 (PST), Stoik wrote: > I am using gps 5.2.1 with utf-8 encoding in the editor. I tried to write a > simple routine to strip the diacritical marks from Polish texts. When > executing a test program, I got the "translation_error" message, and it > turned out that the string consisting of Polish letters was treated as > double the proper length. You can try for yourself: with > s: string := "ó"; > we get s'length=2. Where is the hook? Is it a compiler error, gps error, or my own one? Without source code it is impossible to say. But "ó" in UTF-8 is two octets: 16#C3# 16#B3#. When packed into a string that must be 2 characters long, considering octet=Character (which formally is not, but whatever). P.S. I would not use Latin-1 or anything beyond 7-bit ASCII in the source code in order to make it portable across different systems. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 13:34 ` Dmitry A. Kazakov @ 2013-11-16 15:09 ` Stoik 2013-11-16 15:55 ` Dmitry A. Kazakov 2013-11-16 17:01 ` Georg Bauhaus 2013-11-16 15:12 ` Stoik 1 sibling, 2 replies; 33+ messages in thread From: Stoik @ 2013-11-16 15:09 UTC (permalink / raw) W dniu sobota, 16 listopada 2013 14:34:43 UTC+1 użytkownik Dmitry A. Kazakov napisał: > On Sat, 16 Nov 2013 05:12:29 -0800 (PST), Stoik wrote: > > > > > I am using gps 5.2.1 with utf-8 encoding in the editor. I tried to write a > > > simple routine to strip the diacritical marks from Polish texts. When > > > executing a test program, I got the "translation_error" message, and it > > > turned out that the string consisting of Polish letters was treated as > > > double the proper length. You can try for yourself: with > > > s: string := "ó"; > > > we get s'length=2. Where is the hook? Is it a compiler error, gps error, or my own one? > > > > Without source code it is impossible to say. But "ó" in UTF-8 is two > > octets: 16#C3# 16#B3#. When packed into a string that must be 2 characters > > long, considering octet=Character (which formally is not, but whatever). > > > > P.S. I would not use Latin-1 or anything beyond 7-bit ASCII in the source > > code in order to make it portable across different systems. > > > > -- > > Regards, > > Dmitry A. Kazakov > > http://www.dmitry-kazakov.de Thanks for the answer. Your advice is certainly sound, but not very satisfactory. The whole purpose of utf-8 is to make things portable across platforms. If the compiler cannot deal properly with the source code written in the utf-8 encoding, then the whole effort that went into all the wide_ and wide_wide_ packages and the new packages that deal with various encodings is lost (all the Latin-x possibilities are useless anyway, at least on Windows platform). I am adjoining a trivial program which works differently according to the encoding (UTF-8 or ISO-8859-1) of the source code, printing 1 or 2 as the answer. with ada.text_io; use ada.text_io; procedure example is S : String := "ó"; begin Put_Line (S'Length'Img); end; ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 15:09 ` Stoik @ 2013-11-16 15:55 ` Dmitry A. Kazakov 2013-11-17 13:32 ` Georg Bauhaus 2013-11-16 17:01 ` Georg Bauhaus 1 sibling, 1 reply; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-16 15:55 UTC (permalink / raw) On Sat, 16 Nov 2013 07:09:48 -0800 (PST), Stoik wrote: > If the compiler cannot deal properly with the source code > written in the utf-8 encoding, The compiler can. I believe there are GCC switches which together with locale control that. > then the whole effort that went into all > the wide_ and wide_wide_ packages and the new packages that deal with > various encodings is lost (all the Latin-x possibilities are useless > anyway, at least on Windows platform). Not at all. Ada's String directly corresponds to the A-functions of Windows API. Windows W-functions are UTF-16. And the issue has nothing to do with the language. It is about using one encoding in the editor and another with the compiler. > with ada.text_io; use ada.text_io; > procedure example is > S : String := "ó"; > begin > Put_Line (S'Length'Img); > end; As I said in order to avoid troubles, don't use anything but ASCII. Do this: SMALL_LETTER_O_WITH_ACUTE_UTF8 : constant String := Character'Val (16#C3#) & Character'Val (16#B3#); SMALL_LETTER_O_WITH_ACUTE_Latin1 : constant String := (1 => Character'Val (16#F3#)); -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 15:55 ` Dmitry A. Kazakov @ 2013-11-17 13:32 ` Georg Bauhaus 2013-11-17 14:07 ` Dmitry A. Kazakov 0 siblings, 1 reply; 33+ messages in thread From: Georg Bauhaus @ 2013-11-17 13:32 UTC (permalink / raw) On 16.11.13 16:55, Dmitry A. Kazakov wrote: > As I said in order to avoid troubles, don't use anything but ASCII. ASCII-ism is the soil in which dangerous bugs keep many things from working.(*) With an attitude of denial towards encoding basics, would anyone ever approach *numbers* in the same way? I doubt it. The best medication against chronic character FUD is to (a) see how some unambiguous encoding does work everywhere (e.g. the universally supported UTF-16) (**), (b) understand that single units of text and single octets are not in general isomorphic; this leads to bugs just as harmless or harmful as erroneous execution in the presence of not 'Valid, (c) understand that maybe wasting 9 bits of 16 bit characters (or a few bits per octet sequence in UTF-8) is not worth mentioning these days, considering source text. Part (b) will not come to be as long as most programmers are fine thinking that text is always 7bit characters in real life. If, instead, programmers start learning about further bits--- that Character is a type, not an encoding---integrating software will start working better. __ (*) A big one of these ASCII bugs yields Google's infrastructure stuck with Python 2.7. (**) I understand that even the US Navy has officially started using more characters than ASCII. So, can I maintains hopes that GNAT will one day read source files that use UTF-NN, which GNAT does support? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-17 13:32 ` Georg Bauhaus @ 2013-11-17 14:07 ` Dmitry A. Kazakov 2013-11-17 17:19 ` Dennis Lee Bieber ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-17 14:07 UTC (permalink / raw) On Sun, 17 Nov 2013 14:32:55 +0100, Georg Bauhaus wrote: > On 16.11.13 16:55, Dmitry A. Kazakov wrote: >> As I said in order to avoid troubles, don't use anything but ASCII. > > ASCII-ism is the soil in which dangerous bugs keep many things > from working.(*) On the contrary, it is a reasonable precaution against sloppy OSes (Linux, Windows) incapable to handle encoding safely [*]. The OP just ran into that. If he followed the advise he would never have any problems of this kind. Using full Unicode in source files is a recipe for bugs intractable for many program readers, like ones who would not guess 'a' and 'а' different letters. ------- * Preventing a file encoded as X, being read and written as if it were encoded as Y. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-17 14:07 ` Dmitry A. Kazakov @ 2013-11-17 17:19 ` Dennis Lee Bieber 2013-11-17 18:07 ` Dmitry A. Kazakov 2013-11-17 19:05 ` Georg Bauhaus 2013-11-18 0:34 ` Stoik 2 siblings, 1 reply; 33+ messages in thread From: Dennis Lee Bieber @ 2013-11-17 17:19 UTC (permalink / raw) On Sun, 17 Nov 2013 15:07:18 +0100, "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> declaimed the following: >On Sun, 17 Nov 2013 14:32:55 +0100, Georg Bauhaus wrote: > >> On 16.11.13 16:55, Dmitry A. Kazakov wrote: >>> As I said in order to avoid troubles, don't use anything but ASCII. >> >> ASCII-ism is the soil in which dangerous bugs keep many things >> from working.(*) > >On the contrary, it is a reasonable precaution against sloppy OSes (Linux, >Windows) incapable to handle encoding safely [*]. The OP just ran into >that. If he followed the advise he would never have any problems of this >kind. > >Using full Unicode in source files is a recipe for bugs intractable for >many program readers, like ones who would not guess 'a' and '?' different >letters. > 5-bit BAUDOT should be good enough for any programming! <G> -- Wulfraed Dennis Lee Bieber AF6VN wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-17 17:19 ` Dennis Lee Bieber @ 2013-11-17 18:07 ` Dmitry A. Kazakov 0 siblings, 0 replies; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-17 18:07 UTC (permalink / raw) On Sun, 17 Nov 2013 12:19:23 -0500, Dennis Lee Bieber wrote: > On Sun, 17 Nov 2013 15:07:18 +0100, "Dmitry A. Kazakov" > <mailbox@dmitry-kazakov.de> declaimed the following: > >>On Sun, 17 Nov 2013 14:32:55 +0100, Georg Bauhaus wrote: >> >>> On 16.11.13 16:55, Dmitry A. Kazakov wrote: >>>> As I said in order to avoid troubles, don't use anything but ASCII. >>> >>> ASCII-ism is the soil in which dangerous bugs keep many things >>> from working.(*) >> >>On the contrary, it is a reasonable precaution against sloppy OSes (Linux, >>Windows) incapable to handle encoding safely [*]. The OP just ran into >>that. If he followed the advise he would never have any problems of this >>kind. >> >>Using full Unicode in source files is a recipe for bugs intractable for >>many program readers, like ones who would not guess 'a' and '?' different >>letters. > > 5-bit BAUDOT should be good enough for any programming! On the other side, there exist an alphabet in which any eligible program is just one symbol long. Both Unicode and ASCII are in between. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-17 14:07 ` Dmitry A. Kazakov 2013-11-17 17:19 ` Dennis Lee Bieber @ 2013-11-17 19:05 ` Georg Bauhaus 2013-11-17 20:38 ` Dmitry A. Kazakov 2013-11-18 0:34 ` Stoik 2 siblings, 1 reply; 33+ messages in thread From: Georg Bauhaus @ 2013-11-17 19:05 UTC (permalink / raw) On 17.11.13 15:07, Dmitry A. Kazakov wrote: >> ASCII-ism is the soil in which dangerous bugs keep many things >> from working.(*) > > On the contrary, it is a reasonable precaution against sloppy OSes (Linux, > Windows) incapable to handle encoding safely [*]. The OP just ran into > that. If he followed the advise he would never have any problems of this > kind. > ------- > * Preventing a file encoded as X, being read and written as if it were > encoded as Y. Precaution? ASCII could just as well be EBDCI. When the OS's programming interface does not suggest studying the file type, then the best thing one can do reading a text file is to rely on the data---UTF-NN has a BOM, which is better than nothing, and certainly is better than the any 7bit (or 8bit) ambiguities. It is unfortunate that 7bit engineers can't swallow their pride and use extended files attributes available with all semi-modern and modern file systems and archive formats. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-17 19:05 ` Georg Bauhaus @ 2013-11-17 20:38 ` Dmitry A. Kazakov 2013-11-18 8:38 ` Georg Bauhaus 2013-11-18 8:44 ` Georg Bauhaus 0 siblings, 2 replies; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-17 20:38 UTC (permalink / raw) On Sun, 17 Nov 2013 20:05:26 +0100, Georg Bauhaus wrote: > On 17.11.13 15:07, Dmitry A. Kazakov wrote: > >>> ASCII-ism is the soil in which dangerous bugs keep many things >>> from working.(*) >> >> On the contrary, it is a reasonable precaution against sloppy OSes (Linux, >> Windows) incapable to handle encoding safely [*]. The OP just ran into >> that. If he followed the advise he would never have any problems of this >> kind. > >> ------- >> * Preventing a file encoded as X, being read and written as if it were >> encoded as Y. > > Precaution? ASCII could just as well be EBDCI. Firstly, EBCDIC is practically dead. Secondly, you simply cannot compile any Ada program encoded in EBCDIC as if it were ASCII. No chance. UTF-8 was intentionally designed to be compatible with ASCII, which is why there is a trouble with Latin1 which also was an extension of ASCII. Similarly if somebody used KOI-8 thinking it were Latin1 or UTF-8. The problem is that the common part (ASCII) is sufficient for Ada programming while the varying part is subtle enough to cause difficult to detect bugs in string literals. Bugs that cannot be detected by the compiler. > It is unfortunate that 7bit engineers can't swallow their pride > and use extended files attributes available with all semi-modern > and modern file systems and archive formats. What for? In oder to get silly bugs the OP did? -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-17 20:38 ` Dmitry A. Kazakov @ 2013-11-18 8:38 ` Georg Bauhaus 2013-11-18 9:01 ` Dmitry A. Kazakov 2013-11-18 8:44 ` Georg Bauhaus 1 sibling, 1 reply; 33+ messages in thread From: Georg Bauhaus @ 2013-11-18 8:38 UTC (permalink / raw) On 17.11.13 21:38, Dmitry A. Kazakov wrote: > The problem is that the common part (ASCII) is sufficient for Ada > programming while the varying part is subtle enough to cause difficult to > detect bugs in string literals. Bugs that cannot be detected by the > compiler. UTF-8 can actually be so checked (and is checked by typical implementations) that accidentally mistaking some octets of a string literal for Latin-1 coded characters is impossible: this is a consequence of the design of UTF-8, as you know: the {1}+0 prefix rules. Actually, a compiler---GNAT having a helpful spell checker already---could detect occurrences in string literals of String'(N => Character'Val (195), N+1 => Character'Val (179)) as very likely being the valid UTF-8 sequence representing "ó". It will then emit a warning saying that source text might be UTF-8 rather than Latin-1, and suggest a compiler switch accordingly. Of course, the presence of a BOM can add further support to this warning. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-18 8:38 ` Georg Bauhaus @ 2013-11-18 9:01 ` Dmitry A. Kazakov 2013-11-18 10:06 ` Georg Bauhaus 0 siblings, 1 reply; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-18 9:01 UTC (permalink / raw) On Mon, 18 Nov 2013 09:38:06 +0100, Georg Bauhaus wrote: > On 17.11.13 21:38, Dmitry A. Kazakov wrote: >> The problem is that the common part (ASCII) is sufficient for Ada >> programming while the varying part is subtle enough to cause difficult to >> detect bugs in string literals. Bugs that cannot be detected by the >> compiler. > > UTF-8 can actually be so checked (and is checked by typical implementations) 1. The share of illegal UTF-8 sequences is negligible. The one among Ada programs is even less than that. 2. Latin1 sequences are all legal. Now, carefully observe that the program in question was dealt with as if it were encoded in Latin1. So much for your theory. --------------- P.S. In order to make a point you should take a set of legal [and practical] Ada programs encoded in X and then reinterpreted in Y. Then you compare how many of them become: 1. illegal 2. remain legal keeping the semantics 3. remain legal breaking the semantics The last case is the worst possible scenario, which the OP experienced. P.P.S. Also important when dealing with the issue of keeping it sane ASCII, Ada provides a standard package that defines Latin1 characters: Characters.Latin_1 (RM A.3.3) -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-18 9:01 ` Dmitry A. Kazakov @ 2013-11-18 10:06 ` Georg Bauhaus 0 siblings, 0 replies; 33+ messages in thread From: Georg Bauhaus @ 2013-11-18 10:06 UTC (permalink / raw) On 18.11.13 10:01, Dmitry A. Kazakov wrote: >> UTF-8 can actually be so checked (and is checked by typical implementations) > > 1. The share of illegal UTF-8 sequences is negligible. The one among Ada > programs is even less than that. The share of illegal UTF-8 sequences in source text stays low as long as policies prevent use of anything but ASCII. But! OTOH, the difficulty of adapting to use of limited character sets stays high, unnerving, and costly. (I know because the source text used here and elsewhere is full of ASCII-sequences representing ubiquitous Unicode characters. These are characters that users expect to see. If 1234 codes some common international character, then having to write "abc \x{1234}" all over the place is a PITA. The need to write "abc ["1234"]" GNAT style does not change that.) > 2. Latin1 sequences are all legal. Legality of (only) almost all octets interpreted as Latin-1 characters does not make the interpretation of string literals correct. Correctness involves the problem specification, not just Ada. Which is what matters most: The *user*, the raison d'être of programming, is not really satisfied when legal programs will actually malfunction because of legal ambiguity of legal octets. Would anyone be at ease with similar ambiguity of number literals? > Now, carefully observe that the program in question was dealt with as if it > were encoded in Latin1. So much for your theory. My theory involves programmers, foreign software, and users, in addition to the mere formalism that you mention. > --------------- > P.S. In order to make a point you should take a set of legal [and > practical] Ada programs encoded in X and then reinterpreted in Y. Then you > compare how many of them become: 0. useful > 1. illegal > 2. remain legal keeping the semantics > 3. remain legal breaking the semantics Note that legality can always be established together with 0, and automatically is, once programmers can easily specify character encoding to be something unambiguous. The stubbornness of 7bit engineering in OSs and in other circumstances calls for a pragma Source_Text_Encoding (...); With this warning sign in place, both old and new generations of programmers can do their job. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-17 20:38 ` Dmitry A. Kazakov 2013-11-18 8:38 ` Georg Bauhaus @ 2013-11-18 8:44 ` Georg Bauhaus 2013-11-18 10:24 ` Dmitry A. Kazakov 1 sibling, 1 reply; 33+ messages in thread From: Georg Bauhaus @ 2013-11-18 8:44 UTC (permalink / raw) On 17.11.13 21:38, Dmitry A. Kazakov wrote: >> It is unfortunate that 7bit engineers can't swallow their pride >> >and use extended files attributes available with all semi-modern >> >and modern file systems and archive formats. > What for? In oder to get silly bugs the OP did? In order to be able to integrate software (libraries, sources) that use international characters. Also, indirectly, in order to help programmers getting acquainted with the effects of an 8th bit, and with encodings in general. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-18 8:44 ` Georg Bauhaus @ 2013-11-18 10:24 ` Dmitry A. Kazakov 2013-11-18 13:05 ` G.B. 0 siblings, 1 reply; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-18 10:24 UTC (permalink / raw) On Mon, 18 Nov 2013 09:44:05 +0100, Georg Bauhaus wrote: > On 17.11.13 21:38, Dmitry A. Kazakov wrote: >>> It is unfortunate that 7bit engineers can't swallow their pride >>> >and use extended files attributes available with all semi-modern >>> >and modern file systems and archive formats. > >> What for? In oder to get silly bugs the OP did? > > In order to be able to integrate software (libraries, sources) that > use international characters. Why cannot it be integrated without these bugs? It is like saying that Ada programs must use System.Address in order to be integrated with machine code. We don't want such kind of integration. That is why we are using Ada instead of Assembler. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-18 10:24 ` Dmitry A. Kazakov @ 2013-11-18 13:05 ` G.B. 2013-11-18 15:25 ` Dmitry A. Kazakov 0 siblings, 1 reply; 33+ messages in thread From: G.B. @ 2013-11-18 13:05 UTC (permalink / raw) On 18.11.13 11:24, Dmitry A. Kazakov wrote: >> In order to be able to integrate software (libraries, sources) that >> use international characters. > > Why cannot it be integrated without these bugs? Character literals are not bugs. Ada lacks means of expressing programmer's intent here, that much is true. Encoding could be specified by an aspect, just like 'Size. The language is buggy here, when matched against ubiquitous real world programming situations. > It is like saying that Ada programs must use System.Address in order to be > integrated with machine code. Yes. Machine code without machine addresses would be magic. > We don't want such kind of integration. We can't say: we don't want character literals, or string literals. People use international character literals. Compiling programs that use international characters as per the Ada LRM should work without much ado and without all the FUD-induced avoidance, and without compiler difficulties. To suggest only using ASCII is rather like suggesting to not use FPT, arguing that using FPT leads to results that can differ when switching from Intel to ARM or to PowerPC. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-18 13:05 ` G.B. @ 2013-11-18 15:25 ` Dmitry A. Kazakov 2013-11-18 15:51 ` G.B. 0 siblings, 1 reply; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-18 15:25 UTC (permalink / raw) On Mon, 18 Nov 2013 14:05:45 +0100, G.B. wrote: > Character literals are not bugs. Ada lacks means of > expressing programmer's intent here, that much is true. > Encoding could be specified by an aspect, just like 'Size. > The language is buggy here, when matched against ubiquitous > real world programming situations. You are fundamentally wrong here. Encoding is not an aspect, encoding is a type. Compare: 123 is a literal of Integer, mod 341, Unsigned_16, ... types "A" is a literal of String (Latin1), Wide_String (UCS-2), Wide_Wide_String (UCS-4) Ada can and surely must have UTF-8 and whatever other encoded strings, characters and slices. The reason why this is not done, because of other language problems irrelevant here. [It would cause combinatorial explosion of standard libraries.] Note, with all and any thinkable additions, the problem OP had will still be present, because it has nothing to do with the language itself. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-18 15:25 ` Dmitry A. Kazakov @ 2013-11-18 15:51 ` G.B. 2013-11-18 17:34 ` Dmitry A. Kazakov 0 siblings, 1 reply; 33+ messages in thread From: G.B. @ 2013-11-18 15:51 UTC (permalink / raw) On 18.11.13 16:25, Dmitry A. Kazakov wrote: > Compare: > > 123 is a literal of Integer, mod 341, Unsigned_16, ... types Compare type My_Int is range 1 .. 10 with Size => 16; to type My_Int is range 1 .. 10 with Size => 32; Now type My_Char is range 'A' .. 'Z' with size => 31; to type My_Char is range 'A' .. 'Z' with size => 15; And type My_Float is digits 6 range 0.0 .. 1_000.0 with Mode => Round_To_Nearest_Even; These are representation issues. They direct the compiler to choose (a) a number of bits and (b) a set of operations. "+" will be affected by 'Size, though not at the level of abstract operations. I guess there is a view that says rounding is a type? So what: An encoding aspect would just be a means that allows programmers to say what they mean. It may not be as useful as -gnatW*, it may even be confusing, but it may finally kick the * of compiler makers and project leads, and make them sort out this encoding nonsense once and forever. Without rhetoric. And *then* they can say that specifying aspects is fundamentally wrong. After all, this is about interpreting a bit pattern at compile time, and supplying information about the bits explicitly should help. It should help fixing C's underlying char* issues, too. Even when using 7bit ASCII, there is *no* information in the source that explicitly states what is meant in a given ASCII String literal. So, saying that ASCII works is just accidentally right. A good argument in a C camp. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-18 15:51 ` G.B. @ 2013-11-18 17:34 ` Dmitry A. Kazakov 0 siblings, 0 replies; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-18 17:34 UTC (permalink / raw) On Mon, 18 Nov 2013 16:51:50 +0100, G.B. wrote: > On 18.11.13 16:25, Dmitry A. Kazakov wrote: >> Compare: >> >> 123 is a literal of Integer, mod 341, Unsigned_16, ... types > > Compare > > type My_Int is range 1 .. 10 > with Size => 16; > > to > > type My_Int is range 1 .. 10 > with Size => 32; > > > Now > > type My_Char is range 'A' .. 'Z' > with size => 31; > > to > > type My_Char is range 'A' .. 'Z' > with size => 15; > > > And > > type My_Float is digits 6 range 0.0 .. 1_000.0 > with Mode => Round_To_Nearest_Even; > > These are representation issues. Representation is a property of a type. Whether representation is relevant to the semantics of the type depends on the domain space. The type determines the semantics and relevant aspects of the representation. Not otherwise. > I guess there is a view that says rounding is a type? Certainly. Rounding behavior is determined by the semantics of some type operations. > So what: > > An encoding aspect would just be a means that allows programmers > to say what they mean. If aspect maps to a distinct type then yes. But it is futile to talk about aspects as "Ada's RM aspects" because it is unclear whether they are related to the semantics or are implementation artefacts. Ada 2012 blurred it beyond recognition. > After all, this is about interpreting a bit pattern at compile > time, No. There is no bit patterns at compile time. The compiler operates on an alphabet. Ada's alphabet in unambiguous: RM 2.1. Consider an Ada source packed using LZH. Does that produce another program? No, the program is same. You can use punched cards instead, or write it down on a coaster, it is still the same program. > Even when using 7bit ASCII, there is *no* information in the > source that explicitly states what is meant in a given ASCII > String literal. Again, it has nothing to do with encoding. Which is why errors here are so dangerous. > So, saying that ASCII works is just accidentally > right. Nothing accidental in the case of UTF-8 reinterpreted as Latin1 and conversely. It works because ASCII is an integral part of either. Thus an Ada program written in ASCII is invariant to UTF-8, Latin1 and many other 8-bit encodings. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-17 14:07 ` Dmitry A. Kazakov 2013-11-17 17:19 ` Dennis Lee Bieber 2013-11-17 19:05 ` Georg Bauhaus @ 2013-11-18 0:34 ` Stoik 2 siblings, 0 replies; 33+ messages in thread From: Stoik @ 2013-11-18 0:34 UTC (permalink / raw) W dniu niedziela, 17 listopada 2013 15:07:18 UTC+1 użytkownik Dmitry A. Kazakov napisał: > On Sun, 17 Nov 2013 14:32:55 +0100, Georg Bauhaus wrote: > > > > > On 16.11.13 16:55, Dmitry A. Kazakov wrote: > > >> As I said in order to avoid troubles, don't use anything but ASCII. > > > > > > ASCII-ism is the soil in which dangerous bugs keep many things > > > from working.(*) > > > > On the contrary, it is a reasonable precaution against sloppy OSes (Linux, > > Windows) incapable to handle encoding safely [*]. The OP just ran into > > that. If he followed the advise he would never have any problems of this > > kind. The advice is: do not use cars, they are often badly manufactured and drivers can be mad. Go on foot, far from busy roads! I suspect it is much better to press the companies to produce better OS'es and compilers. People do use various languages and/or scripts. And seeing dozens of packages for handling strange characters without the possibility of using them in a natural manner is a bit frustrating. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 15:09 ` Stoik 2013-11-16 15:55 ` Dmitry A. Kazakov @ 2013-11-16 17:01 ` Georg Bauhaus 2013-11-17 10:38 ` Stoik 1 sibling, 1 reply; 33+ messages in thread From: Georg Bauhaus @ 2013-11-16 17:01 UTC (permalink / raw) On 16.11.13 16:09, Stoik wrote: > Thanks for the answer. Your advice is certainly sound, but not very satisfactory. The whole purpose of utf-8 is to make > things portable across platforms. If the compiler cannot deal properly with the > source code written in the utf-8 encoding, then the whole effort that went into > all the wide_ and wide_wide_ packages and the new packages that deal with various encodings is lost (all the Latin-x possibilities are useless anyway, at least on Windows platform). I am adjoining a trivial program which works differently according to the encoding (UTF-8 or ISO-8859-1) of the source code, printing 1 or 2 as the answer. > > with ada.text_io; use ada.text_io; > procedure example is > S : String := "ó"; > begin > Put_Line (S'Length'Img); > end; GNAT has two switches that affect its way of looking at coded characters in source text: for identifiers in source text, specify -gnatiC where C is one of the characters listed 3.2.10 of the GNAT UG accompanying the compiler; for the wide character encoding method, specify -gnatWE where E is one of the characters listed in the same document. With switch -gnatW8, I get $ ./example 1 $ That is, the source text is understood to be encoded in UTF-8, and 'ó' becomes Character'Val (243), viz. LC_O_Acute. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 17:01 ` Georg Bauhaus @ 2013-11-17 10:38 ` Stoik 0 siblings, 0 replies; 33+ messages in thread From: Stoik @ 2013-11-17 10:38 UTC (permalink / raw) W dniu sobota, 16 listopada 2013 18:01:07 UTC+1 użytkownik Georg Bauhaus napisał: > On 16.11.13 16:09, Stoik wrote: > > > > > Thanks for the answer. Your advice is certainly sound, but not very satisfactory. The whole purpose of utf-8 is to make > > > things portable across platforms. If the compiler cannot deal properly with the > > > source code written in the utf-8 encoding, then the whole effort that went into > > > all the wide_ and wide_wide_ packages and the new packages that deal with various encodings is lost (all the Latin-x possibilities are useless anyway, at least on Windows platform). I am adjoining a trivial program which works differently according to the encoding (UTF-8 or ISO-8859-1) of the source code, printing 1 or 2 as the answer. > > > > > > with ada.text_io; use ada.text_io; > > > procedure example is > > > S : String := "ó"; > > > begin > > > Put_Line (S'Length'Img); > > > end; > > > > GNAT has two switches that affect its way of looking at > > coded characters in source text: > > > > for identifiers in source text, specify -gnatiC > > where C is one of the characters listed 3.2.10 > > of the GNAT UG accompanying the compiler; > > > > for the wide character encoding method, specify -gnatWE > > where E is one of the characters listed in the > > same document. > > > > With switch -gnatW8, I get > > > > $ ./example > > 1 > > $ > > > > That is, the source text is understood to be encoded > > in UTF-8, and 'ó' becomes Character'Val (243), viz. LC_O_Acute. Thank you for solving the problem, by mistake I have thanked another auther first. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 13:34 ` Dmitry A. Kazakov 2013-11-16 15:09 ` Stoik @ 2013-11-16 15:12 ` Stoik 2013-11-16 15:57 ` Dmitry A. Kazakov 2013-11-16 20:06 ` Peter C. Chapin 1 sibling, 2 replies; 33+ messages in thread From: Stoik @ 2013-11-16 15:12 UTC (permalink / raw) By the way, nothing changes if I use wide_character and wide_string instead of character and string. Even if character=octet, certainly wide_character is not an octet! ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 15:12 ` Stoik @ 2013-11-16 15:57 ` Dmitry A. Kazakov 2013-11-17 11:12 ` Stoik 2013-11-16 20:06 ` Peter C. Chapin 1 sibling, 1 reply; 33+ messages in thread From: Dmitry A. Kazakov @ 2013-11-16 15:57 UTC (permalink / raw) On Sat, 16 Nov 2013 07:12:20 -0800 (PST), Stoik wrote: > By the way, nothing changes if I use wide_character and wide_string > instead of character and string. Even if character=octet, certainly > wide_character is not an octet! String = Latin1 Wide_String = UCS-2 There is no built-in type for UTF-8, though customary one uses String for it (and Wide_String for UTF-16). -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 15:57 ` Dmitry A. Kazakov @ 2013-11-17 11:12 ` Stoik 2013-11-22 1:03 ` Randy Brukardt 0 siblings, 1 reply; 33+ messages in thread From: Stoik @ 2013-11-17 11:12 UTC (permalink / raw) W dniu sobota, 16 listopada 2013 16:57:56 UTC+1 użytkownik Dmitry A. Kazakov napisał: > On Sat, 16 Nov 2013 07:12:20 -0800 (PST), Stoik wrote: > > > > > By the way, nothing changes if I use wide_character and wide_string > > > instead of character and string. Even if character=octet, certainly > > > wide_character is not an octet! > > > > String = Latin1 > > Wide_String = UCS-2 > > > > There is no built-in type for UTF-8, though customary one uses String for > > it (and Wide_String for UTF-16). > > > > -- > > Regards, > > Dmitry A. Kazakov > > http://www.dmitry-kazakov.de Thanks for your comments. It is obviously a question of having a different encoding in the editor and the compiler. I forgot to add the -gnatW8 switch to the compiler (this should be a default, I believe). Nevertheless, there still are some misunderstanding connected with string, wide_string and wide_wide_string. They do not correspond to any encodings, they just correspond to character repertoires of the encodings you mentioned. String to the first 256 characters from Unicode (or ISO-10646), wide_string to BMP, and wide_wide_string to the whole Unicode. In particular, wide_string can be encoded internally using any of utf-8,16,32, the programmer does not need to know anything about it. I do not believe one should avoid using characters from outside ASCII in the source code. I tried it in Python and Java with no problems whatsoever. Using some strange constants instead of usual glyphs for characters outside ASCII when using subprograms from ada.(wide_)strings.maps, for example to_mapping, would be gruesome. In any case, GNAT is prepared to deal with the problem properly, although the number of steps the user must remember about is a bit too high (setting environment variable charset to utf-8, choosing utf-8 in the source editor,adding -gnatW8 to the compiler switches and -W8 to pretty printer switches. And the UTF-8 is the only encoding that solves the problem of non-Latin1 characters at all. Regards ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-17 11:12 ` Stoik @ 2013-11-22 1:03 ` Randy Brukardt 2013-11-22 3:02 ` Shark8 0 siblings, 1 reply; 33+ messages in thread From: Randy Brukardt @ 2013-11-22 1:03 UTC (permalink / raw) "Stoik" <staszek.goldstein@gmail.com> wrote in message news:7464679c-6b98-4e23-a337-83b671473553@googlegroups.com... > Thanks for your comments. It is obviously a question of having a different > encoding in the > editor and the compiler. I forgot to add the -gnatW8 switch to the > compiler (this should be > a default, I believe). Ada 2012 requires compilers to accept UTF-8 source code. But given that Ada source code historically is Latin-1, it's very unlikely that compilers would change the default setting. The effect would be to break the compilation of much existing source, a step that most compiler vendors would never take. Speaking as a vendor, Janus/Ada has a number of default switches that would never be the default choices today. But changing the defaults breaks *everyone's* build scripts; it's just so disruptive that it's not something that we would do unless there was no other choice. It makes command line use of compilers with an extensive history harder than we would like, but that's the price of having customers that go way back. If UTF-8 files were somehow identified as such, we could have friendlier defaults -- but since the use of the BOM is optional (and discouraged in recent Unicode standards), and there are no encoding attributes in common file systems (Windows, Linux) -- there really isn't much that we can do. This is going to remain a mess for a long time to come, I fear. Randy. P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no support for any other encoding (of course it supports Wide_String at runtime). That will have to change as we migrate to Ada 2012, but it probably will be a while before that happens (not a lot of demand). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-22 1:03 ` Randy Brukardt @ 2013-11-22 3:02 ` Shark8 2013-11-22 11:54 ` Georg Bauhaus 2013-11-23 4:14 ` Randy Brukardt 0 siblings, 2 replies; 33+ messages in thread From: Shark8 @ 2013-11-22 3:02 UTC (permalink / raw) On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote: > > P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no > support for any other encoding (of course it supports Wide_String at > runtime). That will have to change as we migrate to Ada 2012, but it > probably will be a while before that happens (not a lot of demand). Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from the customers]? (Also, would having a package aspect declaring that the /contents/ are to be read as UTF-8 [or any recognized encoding] be a possible workable solution to this problem? -- then you could have a package of String-constants of the proper encoding.) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-22 3:02 ` Shark8 @ 2013-11-22 11:54 ` Georg Bauhaus 2013-11-23 4:14 ` Randy Brukardt 1 sibling, 0 replies; 33+ messages in thread From: Georg Bauhaus @ 2013-11-22 11:54 UTC (permalink / raw) On 22.11.13 04:02, Shark8 wrote: > On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote: >> >> P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has no >> support for any other encoding (of course it supports Wide_String at >> runtime). That will have to change as we migrate to Ada 2012, but it >> probably will be a while before that happens (not a lot of demand). > > Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from the customers]? > > (Also, would having a package aspect declaring that the /contents/ are to be read as UTF-8 [or any recognized encoding] be a possible workable solution to this problem? -- then you could have a package of String-constants of the proper encoding.) For literals, in general, I think that static expression functions will be valuable. I wonder why these have not yet been defined? For example, an implementation such as Janus/Ada reads string literals as Latin-1, and therefore, then, static expression functions could test properties of the literal. (Length checks being another useful, though less reliable option.) Then, when read as Latin-1, the literal String_3'("§ 1") in the Subject parameter of Is_UTF_8 (First => 1, Subject => "§ 1") would form part of a static expression that is checked at compile time. In a static predicate, say. package UTF_8_Checks is pragma Pure (UTF_8_Checks); -- (Not working statically, in current Ada.) -- If: -- - static functions include expression functions of only -- static expressions, -- -- then function Is_UTF_8 below can test a string literal -- at compile time. U0 : constant := 0; U1 : constant := 2#1000_0000#; U2 : constant := 2#1100_0000#; U3 : constant := 2#1110_0000#; U4 : constant := 2#1111_0000#; U5 : constant := 2#1111_1000#; UX : constant := 255; subtype XString is String (1 .. 12) with Static_Predicate => XString'Last < Positive'Last; -- for string_literals of a static string subtype type XInteger is range 0 .. 255; function Is_UTF_8_Follow (C : Character) return Boolean is -- an octet that has its most significant bit set, but -- not the next one: (Character'Pos (C) in U1 .. U2 - 1); function Is_UTF_8 (First : Positive; Subject : XString) return Boolean is -- every sequence of characters from Subject is a valid UTF-8 -- sequence, assuming code points up to 16#10_FFFF#. (if First > Subject'Last then True else (case XInteger (Character'Pos (Subject (First))) is when 0 .. U1 - 1 => -- "ASCII 7 bit" Is_UTF_8 (First + 1, Subject), when U1 .. U2 - 1 => -- handled by Is_UTF_8_Follow False, when U2 .. U3 - 1 => (if First > Subject'Last - 1 then False else (for all j in 1 .. 1 => Is_UTF_8_Follow (Subject (First + j))) and Is_UTF_8 (First + 2, Subject)), when U3 .. U4 - 1 => (if First > Subject'Last - 2 then False else (for all j in 1 .. 2 => Is_UTF_8_Follow (Subject (First + j))) and Is_UTF_8 (First + 3, Subject)), when U4 .. U5 - 1 => (if First > Subject'Last - 3 then False else (for all j in 1 .. 3 => Is_UTF_8_Follow (Subject (First + j))) and Is_UTF_8 (First + 4, Subject)), when U5 .. UX => False)); end UTF_8_Checks; ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-22 3:02 ` Shark8 2013-11-22 11:54 ` Georg Bauhaus @ 2013-11-23 4:14 ` Randy Brukardt 2013-12-06 2:17 ` Georg Bauhaus 1 sibling, 1 reply; 33+ messages in thread From: Randy Brukardt @ 2013-11-23 4:14 UTC (permalink / raw) "Shark8" <onewingedshark@gmail.com> wrote in message news:672ce4f6-8c65-43b5-b04b-a7b858205af8@googlegroups.com... > On Thursday, November 21, 2013 6:03:29 PM UTC-7, Randy Brukardt wrote: >> >> P.S. Truth-in-advertising: Janus/Ada *only* takes Latin-1 input; it has >> no >> support for any other encoding (of course it supports Wide_String at >> runtime). That will have to change as we migrate to Ada 2012, but it >> probably will be a while before that happens (not a lot of demand). > > Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from > the customers]? Not a lot of demand for UTF-8 or wide characters in general. As far as Ada 2012 goes, if I want to use a feature, it somehow gets in the compiler. :-) Customer demand not required (but it always helps). Randy. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-23 4:14 ` Randy Brukardt @ 2013-12-06 2:17 ` Georg Bauhaus 0 siblings, 0 replies; 33+ messages in thread From: Georg Bauhaus @ 2013-12-06 2:17 UTC (permalink / raw) On 23.11.13 05:14, Randy Brukardt wrote: > "Shark8" <onewingedshark@gmail.com> wrote >> Not a lot of demand for UTF-8, or not a lot of demand for Ada-2012 [from >> the customers]? > > Not a lot of demand for UTF-8 or wide characters in general. As far as Ada > 2012 goes, if I want to use a feature, it somehow gets in the compiler. :-) > Customer demand not required (but it always helps). Actually, programmers seem to suppress existing demand. Equating "customers" to "consumers" of software for the moment (who pays?), customers suffer from ASCII-fied communication in ways that would not be accepted if written on paper. I got a terribly malformed computer generated messages from no lesser company than DHL (inspiring this follow up). "??" in the mail text quoted below has obviously been put in place of what was perfectly UTF-8 encoded character data. (In the mail's source text, to be sure.) The non-ASCII character is 'ü' (16#FC#) in both cases (L.8, L.10): +======================================================================+ Subject: Ihre Sendung wurde in eine FILIALE umgeleitet MIME-Version: 1.0 Content-Type: text/plain; charset=ANSI_X3.4-1968 Content-Transfer-Encoding: 7bit Guten Tag Herr Georg Bauhaus, leider konnte Ihre Sendung NICHT in die gew??nschte PACKSTATION eingestellt werden. Die Sendung liegt f??r Sie in der FILIALE (...) +======================================================================+ Ironically, the messages are produced using an industry standard Java framework while Java's char data are not 7bit ASCII: Message-ID: <...48667.JavaMail.ypqbson@HANPQ021> These messages used to be o.K. in the past. Judging by the count of excess spaces and long and empty lines in the message, I guess they are having some competitive programming shop streamline their software. Character set support can be a real issue when the use of ASCII leads to misprints of addresses, or to ambiguity in legal documents. Consider families Joseph Müller (16#FC#) and Joseph Möller (16#F6#) each owning a flat in the same house. If rendered Fam. Joseph M??ller X Str. 15 ... and Fam. Joseph M??ller X Str. 15 ... respectively, what is the postman to do? Proper support for encoding all characters is a necessity! ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 15:12 ` Stoik 2013-11-16 15:57 ` Dmitry A. Kazakov @ 2013-11-16 20:06 ` Peter C. Chapin 2013-11-17 10:34 ` Stoik 2013-11-22 0:53 ` Randy Brukardt 1 sibling, 2 replies; 33+ messages in thread From: Peter C. Chapin @ 2013-11-16 20:06 UTC (permalink / raw) On Sat, 16 Nov 2013, Stoik wrote: > By the way, nothing changes if I use wide_character and wide_string > instead of character and string. Even if character=octet, certainly > wide_character is not an octet! It sounds like you want something like function UTF8_String_To_Wide_String(S : String) return Wide_String; UTF-8 is a variable length encoding and thus not the same beast as Wide_String. String literals are going to be encoded in the same manner as the rest of the source text, of course. Peter ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 20:06 ` Peter C. Chapin @ 2013-11-17 10:34 ` Stoik 2013-11-22 0:53 ` Randy Brukardt 1 sibling, 0 replies; 33+ messages in thread From: Stoik @ 2013-11-17 10:34 UTC (permalink / raw) W dniu sobota, 16 listopada 2013 21:06:28 UTC+1 użytkownik Peter C. Chapin napisał: > On Sat, 16 Nov 2013, Stoik wrote: > > > > > By the way, nothing changes if I use wide_character and wide_string > > > instead of character and string. Even if character=octet, certainly > > > wide_character is not an octet! > > > > It sounds like you want something like > > > > function UTF8_String_To_Wide_String(S : String) return Wide_String; > > > > UTF-8 is a variable length encoding and thus not the same beast as > > Wide_String. String literals are going to be encoded in the same manner as > > the rest of the source text, of course. > > > > Peter Thank you, I always use the switches, this time I forgot to add them :( This solves the problem, everything works fine. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: strange behaviour of utf-8 files 2013-11-16 20:06 ` Peter C. Chapin 2013-11-17 10:34 ` Stoik @ 2013-11-22 0:53 ` Randy Brukardt 1 sibling, 0 replies; 33+ messages in thread From: Randy Brukardt @ 2013-11-22 0:53 UTC (permalink / raw) "Peter C. Chapin" <PChapin@vtc.vsc.edu> wrote in message news:alpine.DEB.2.02.1311161503000.6074@whirlwind... > On Sat, 16 Nov 2013, Stoik wrote: > >> By the way, nothing changes if I use wide_character and wide_string >> instead of character and string. Even if character=octet, certainly >> wide_character is not an octet! > > It sounds like you want something like > > function UTF8_String_To_Wide_String(S : String) return Wide_String; > > UTF-8 is a variable length encoding and thus not the same beast as > Wide_String. String literals are going to be encoded in the same manner as > the rest of the source text, of course. Ada 2012 has Ada.Strings.UTF_Encodings for run-time encoding conversions. (See A.4.11.) We might be able to do better in the next version of Ada (whenever that is), but I wouldn't hold my breath. Randy. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2013-12-06 2:17 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-11-16 13:12 strange behaviour of utf-8 files Stoik 2013-11-16 13:34 ` Dmitry A. Kazakov 2013-11-16 15:09 ` Stoik 2013-11-16 15:55 ` Dmitry A. Kazakov 2013-11-17 13:32 ` Georg Bauhaus 2013-11-17 14:07 ` Dmitry A. Kazakov 2013-11-17 17:19 ` Dennis Lee Bieber 2013-11-17 18:07 ` Dmitry A. Kazakov 2013-11-17 19:05 ` Georg Bauhaus 2013-11-17 20:38 ` Dmitry A. Kazakov 2013-11-18 8:38 ` Georg Bauhaus 2013-11-18 9:01 ` Dmitry A. Kazakov 2013-11-18 10:06 ` Georg Bauhaus 2013-11-18 8:44 ` Georg Bauhaus 2013-11-18 10:24 ` Dmitry A. Kazakov 2013-11-18 13:05 ` G.B. 2013-11-18 15:25 ` Dmitry A. Kazakov 2013-11-18 15:51 ` G.B. 2013-11-18 17:34 ` Dmitry A. Kazakov 2013-11-18 0:34 ` Stoik 2013-11-16 17:01 ` Georg Bauhaus 2013-11-17 10:38 ` Stoik 2013-11-16 15:12 ` Stoik 2013-11-16 15:57 ` Dmitry A. Kazakov 2013-11-17 11:12 ` Stoik 2013-11-22 1:03 ` Randy Brukardt 2013-11-22 3:02 ` Shark8 2013-11-22 11:54 ` Georg Bauhaus 2013-11-23 4:14 ` Randy Brukardt 2013-12-06 2:17 ` Georg Bauhaus 2013-11-16 20:06 ` Peter C. Chapin 2013-11-17 10:34 ` Stoik 2013-11-22 0:53 ` Randy Brukardt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox