* System.WCh_Cnv @ 2006-07-12 14:13 Y.Tomino 2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik 2006-07-12 18:57 ` System.WCh_Cnv Björn Persson 0 siblings, 2 replies; 21+ messages in thread From: Y.Tomino @ 2006-07-12 14:13 UTC (permalink / raw) Hello. Ada of gcc-4.1, Why UTF-32 routines of System.WCh_Cnv handle JIS character code when WCEM_Shift_JIS or WCEM_EUC ? Other options seem to mean Unicode. JIS character code is not compatible with Unicode. Also, why it does not use C-runtime locale functions like mbstowcs. I'm getting confused. -- YT ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-12 14:13 System.WCh_Cnv Y.Tomino @ 2006-07-12 15:51 ` Martin Krischik 2006-07-12 18:57 ` System.WCh_Cnv Björn Persson 2006-07-13 17:24 ` System.WCh_Cnv demoonlit 2006-07-12 18:57 ` System.WCh_Cnv Björn Persson 1 sibling, 2 replies; 21+ messages in thread From: Martin Krischik @ 2006-07-12 15:51 UTC (permalink / raw) Y.Tomino wrote: > Ada of gcc-4.1, Why UTF-32 routines of System.WCh_Cnv handle JIS > character code when WCEM_Shift_JIS or WCEM_EUC ? Other options seem to > mean Unicode. > JIS character code is not compatible with Unicode. > Also, why it does not use C-runtime locale functions like mbstowcs. > I'm getting confused. Well it's a System.* package and most System packages are for internal compiler use. IS you want propper unicode support you should use XML/Ada. You can get a copy from gnuada.sf.net. Martin -- mailto://krischik@users.sourceforge.net Ada programming at: http://ada.krischik.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik @ 2006-07-12 18:57 ` Björn Persson 2006-07-13 17:24 ` System.WCh_Cnv demoonlit 1 sibling, 0 replies; 21+ messages in thread From: Björn Persson @ 2006-07-12 18:57 UTC (permalink / raw) Martin Krischik wrote: > IS you want propper unicode support you should use XML/Ada. Does XML/Ada support any JIS encodings? I thought it had only some ISO 8859 and UTF encodings. -- Bj�rn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik 2006-07-12 18:57 ` System.WCh_Cnv Björn Persson @ 2006-07-13 17:24 ` demoonlit 2006-07-13 21:30 ` System.WCh_Cnv Björn Persson 1 sibling, 1 reply; 21+ messages in thread From: demoonlit @ 2006-07-13 17:24 UTC (permalink / raw) Martin Krischik wrote: > Well it's a System.* package and most System packages are for internal > compiler use. Thank you, and yes, System.WCh_Cnv is internal used by Wide_Text_IO and compiler. Then, text read with Wide_Text_IO be JIS character code...because my Windows comand prompt used Shift-JIS. So I'd convert JIS character code to other code set(with iconv, mbstowcs, etc). Raw JIS character code values are not popular in Japan. Text data often encoded as Shift-JIS, EUC or UTF-8. And other Ada packages assume Wide_String as Unicode. I think it's natural, because C-functions assume wchar_t as Unicode. If at all possible, I want take Wide_String as Unicode same as C-runtime functions. System.WCh_Cnv confound JIS character code with Unicode, it makes troubles. Wide_Text_IO (and -gnatWs, -gantWe) are useless in fact, because there is no what uses JIS character code as it is, conversion is needed after all. So, I want to know why System.WCh_Cnv takes JIS character code? -- YT ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-13 17:24 ` System.WCh_Cnv demoonlit @ 2006-07-13 21:30 ` Björn Persson 2006-07-14 7:19 ` System.WCh_Cnv Dmitry A. Kazakov 2006-07-14 7:40 ` System.WCh_Cnv Martin Krischik 0 siblings, 2 replies; 21+ messages in thread From: Björn Persson @ 2006-07-13 21:30 UTC (permalink / raw) demoonlit@panathenaia.halfmoon.jp wrote: > Windows comand prompt OK, so your OS is Windows. > So I'd convert JIS character code to other code set(with iconv, > mbstowcs, etc). Do you have Iconv? I thought that wasn't available in Windows. > And other Ada packages assume Wide_String as Unicode. I think it's > natural, because C-functions assume wchar_t as Unicode. > If at all possible, I want take Wide_String as Unicode same as > C-runtime functions. Let's get some things straight so that we'll understand each other better. Unicode defines several character encodings. When you write "Unicode", do you mean UTF-8, UTF-16, UCS-4 (also called UTF-32) or UCS-2? Ada's Wide_String is UCS-2 and Wide_Wide_String is UCS-4. In C it's implementation-defined how wide a wide character is. As far as I know, wchar_t is usually 32 bits in Unix, so that a "wide string" is UCS-4. I hear Microsoft uses 16 bits for wchar_t, but I'm not sure whether a "wide string" in Windows is treated as UCS-2 or UTF-16. > System.WCh_Cnv confound JIS character code with Unicode, it makes > troubles. Wide_Text_IO (and -gnatWs, -gantWe) are useless in fact, > because there is no what uses JIS character code as it is, conversion > is needed after all. I haven't used that package myself so I don't know how it works, but I won't be surprised if it's buggy. In my experience, Adacore's handling of character encodings is rather unimpressive. > So, I want to know why System.WCh_Cnv takes JIS character code? As it's a Gnat-specific package, only Adacore knows why they did it the way they did. You could ask them, but I don't think they'll answer this kind of questions unless you buy a support contract. -- Bj�rn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-13 21:30 ` System.WCh_Cnv Björn Persson @ 2006-07-14 7:19 ` Dmitry A. Kazakov 2006-07-14 7:40 ` System.WCh_Cnv Martin Krischik 1 sibling, 0 replies; 21+ messages in thread From: Dmitry A. Kazakov @ 2006-07-14 7:19 UTC (permalink / raw) On Thu, 13 Jul 2006 21:30:15 GMT, Bj�rn Persson wrote: > I hear Microsoft uses 16 bits for wchar_t, but I'm not sure > whether a "wide string" in Windows is treated as UCS-2 or UTF-16. The latter, AFAIK. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-13 21:30 ` System.WCh_Cnv Björn Persson 2006-07-14 7:19 ` System.WCh_Cnv Dmitry A. Kazakov @ 2006-07-14 7:40 ` Martin Krischik 2006-07-14 12:18 ` System.WCh_Cnv Björn Persson 2006-07-14 16:13 ` System.WCh_Cnv Georg Bauhaus 1 sibling, 2 replies; 21+ messages in thread From: Martin Krischik @ 2006-07-14 7:40 UTC (permalink / raw) Björn Persson schrieb: > Let's get some things straight so that we'll understand each other > better. Unicode defines several character encodings. When you write > "Unicode", do you mean UTF-8, UTF-16, UCS-4 (also called UTF-32) or > UCS-2? Ada's Wide_String is UCS-2 and Wide_Wide_String is UCS-4. I wonder about that. UCS character set are fixed length and UTF character sets are variable lengt. So is it rigth to say that UCS-4 is UTF-32? Martin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-14 7:40 ` System.WCh_Cnv Martin Krischik @ 2006-07-14 12:18 ` Björn Persson 2006-07-16 11:41 ` System.WCh_Cnv Martin Krischik 2006-07-14 16:13 ` System.WCh_Cnv Georg Bauhaus 1 sibling, 1 reply; 21+ messages in thread From: Björn Persson @ 2006-07-14 12:18 UTC (permalink / raw) Martin Krischik wrote: > I wonder about that. UCS character set are fixed length and UTF > character sets are variable lengt. So is it rigth to say that UCS-4 > is UTF-32? I believe every possible text will be encoded identically in UCS-4BE and UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a counter-example then I would like to see it. What character could take up more than one code unit in UTF-32? -- Bj�rn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-14 12:18 ` System.WCh_Cnv Björn Persson @ 2006-07-16 11:41 ` Martin Krischik 2006-07-24 21:00 ` System.WCh_Cnv Björn Persson 0 siblings, 1 reply; 21+ messages in thread From: Martin Krischik @ 2006-07-16 11:41 UTC (permalink / raw) Bjï¿œrn Persson wrote: > Martin Krischik wrote: >> I wonder about that. UCS character set are fixed length and UTF >> character sets are variable lengt. So is it rigth to say that UCS-4 >> is UTF-32? > > I believe every possible text will be encoded identically in UCS-4BE and > UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a > counter-example then I would like to see it. What character could take > up more than one code unit in UTF-32? A few years ago you could have said the same replacing all '32' with '16'. Many programmers relied on UTF-16 and UCS-2 being the the same. There where no counter-examples at the time either. But one fine day in 2001 the unicode authority(s) defined the 65537'th character... I know that currently only 21 bits are actually used and the unicode authority(s) have given up on using more codepoints. Still I am unsure of just declaring them both the same. Martin -- mailto://krischik@users.sourceforge.net Ada programming at: http://ada.krischik.com ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-16 11:41 ` System.WCh_Cnv Martin Krischik @ 2006-07-24 21:00 ` Björn Persson 2006-07-24 23:35 ` System.WCh_Cnv Randy Brukardt 0 siblings, 1 reply; 21+ messages in thread From: Björn Persson @ 2006-07-24 21:00 UTC (permalink / raw) Martin Krischik wrote: > Bjï¿œrn Persson wrote: > >> Martin Krischik wrote: >>> I wonder about that. UCS character set are fixed length and UTF >>> character sets are variable lengt. So is it rigth to say that UCS-4 >>> is UTF-32? >> I believe every possible text will be encoded identically in UCS-4BE and >> UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a >> counter-example then I would like to see it. What character could take >> up more than one code unit in UTF-32? > > A few years ago you could have said the same replacing all '32' with '16'. > Many programmers relied on UTF-16 and UCS-2 being the the same. There where > no counter-examples at the time either. But one fine day in 2001 the > unicode authority(s) defined the 65537'th character... > > I know that currently only 21 bits are actually used and the unicode > authority(s) have given up on using more codepoints. Still I am unsure of > just declaring them both the same. How about a *hypothetical* counter-example? If you had a character with the code point 100000000 hexadecimal, how would you encode it in UTF-32? I believe it's impossible; I believe UTF-32 is a fixed-width encoding. I found what looks like the definition of UTF-32 in http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, page 76 (actually page 23 in the file). It says: "UTF-32 encoding form: The Unicode encoding form which assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value." Note "single". Also, in Unicode Technical Report #17, at http://www.unicode.org/reports/tr17/, UTF-32 is listed under "Examples of fixed-width encoding forms", while UTF-16 is listed under "Examples of variable-width encoding forms". Of course, should the Unicode consortium make the unwise decision to change the definition of UTF-32, then it might no longer be equivalent to UCS-4, but then it would no longer be UTF-32. I would be a different encoding, and would deserve a different name. -- Bjï¿œrn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-24 21:00 ` System.WCh_Cnv Björn Persson @ 2006-07-24 23:35 ` Randy Brukardt 2006-07-25 0:45 ` System.WCh_Cnv Marius Amado-Alves 0 siblings, 1 reply; 21+ messages in thread From: Randy Brukardt @ 2006-07-24 23:35 UTC (permalink / raw) [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 786 bytes --] "Bj�rn Persson" <spam-away@nowhere.nil> wrote in message news:Sxaxg.10370$E02.3445@newsb.telia.net... >... > How about a *hypothetical* counter-example? If you had a character with > the code point 100000000 hexadecimal, how would you encode it in UTF-32? > I believe it's impossible; I believe UTF-32 is a fixed-width encoding. It would have to be hypothetical: Unicode is a 31-bit character set. Note that I said 31 bits, not 32-bits. Thus, UTF-32 and UCS-4 are the same *if encoding Unicode characters*. (UTF-32 would need extra bytes to encode 32-bit characters with the high bit on, but those would not be Unicode characters.) Of course, some future character set could use more then 31-bits, but that seems well into the future. Randy Brukardt. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-24 23:35 ` System.WCh_Cnv Randy Brukardt @ 2006-07-25 0:45 ` Marius Amado-Alves 0 siblings, 0 replies; 21+ messages in thread From: Marius Amado-Alves @ 2006-07-25 0:45 UTC (permalink / raw) To: Randy Brukardt; +Cc: comp.lang.ada >> How about a *hypothetical* counter-example? If you had a character >> with >> the code point 100000000 hexadecimal, how would you encode it in >> UTF-32? > > It would have to be hypothetical: Unicode is a 31-bit character set. Actually the Unicode codepoint range is 0 .. 10FFFF and therefore fits in 21 bits. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-14 7:40 ` System.WCh_Cnv Martin Krischik 2006-07-14 12:18 ` System.WCh_Cnv Björn Persson @ 2006-07-14 16:13 ` Georg Bauhaus 1 sibling, 0 replies; 21+ messages in thread From: Georg Bauhaus @ 2006-07-14 16:13 UTC (permalink / raw) Martin Krischik wrote: > Bj�rn Persson schrieb: > >> Let's get some things straight so that we'll understand each other >> better. Unicode defines several character encodings. When you write >> "Unicode", do you mean UTF-8, UTF-16, UCS-4 (also called UTF-32) or >> UCS-2? Ada's Wide_String is UCS-2 and Wide_Wide_String is UCS-4. > > I wonder about that. UCS character set are fixed length and UTF > character sets are variable lengt. So is it rigth to say that UCS-4 > is UTF-32? UTF is a UCS Tranformation Format. UCS is the Universal Multiple-Octet Coded Character Set. UCS-4 is the canonical form of the UCS Not sure whether UCS has anything to say about endianness. UTF has. -- Georg ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-12 14:13 System.WCh_Cnv Y.Tomino 2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik @ 2006-07-12 18:57 ` Björn Persson 2006-07-13 17:34 ` System.WCh_Cnv demoonlit 1 sibling, 1 reply; 21+ messages in thread From: Björn Persson @ 2006-07-12 18:57 UTC (permalink / raw) Y.Tomino wrote: > Ada of gcc-4.1, Why UTF-32 routines of System.WCh_Cnv handle JIS > character code when WCEM_Shift_JIS or WCEM_EUC ? Other options seem to > mean Unicode. If you happen to be using some Unix-like operating system which is Posix-conformant enough to have the Iconv library, then you might want to use the EAstrings packages in AdaCL (http://adacl.sourceforge.net/). EAstrings converts automatically between all the encodings your Iconv library knows. Otherwise, there is System.WCh_JIS, which is supposed to handle EUC and Shift-JIS. Have you tried that? (I still need to get EAstrings working in Windows. I could really use some help from a Windows programmer.) -- Bj�rn Persson PGP key A88682FD omb jor ers @sv ge. r o.b n.p son eri nu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-12 18:57 ` System.WCh_Cnv Björn Persson @ 2006-07-13 17:34 ` demoonlit 0 siblings, 0 replies; 21+ messages in thread From: demoonlit @ 2006-07-13 17:34 UTC (permalink / raw) I don't want to convert EUC <=> Shift JIS now, and want to use Wide_* correctly. JIS character code representation of Wide_String is meaningless... ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <8BB3B99E-16DA-4EBF-A2FE-50B079349CA9@amado-alves.info>]
* Re: System.WCh_Cnv [not found] <8BB3B99E-16DA-4EBF-A2FE-50B079349CA9@amado-alves.info> @ 2006-07-25 0:45 ` Marius Amado-Alves 0 siblings, 0 replies; 21+ messages in thread From: Marius Amado-Alves @ 2006-07-25 0:45 UTC (permalink / raw) To: comp.lang.ada >> How about a *hypothetical* counter-example? If you had a character >> with >> the code point 100000000 hexadecimal, how would you encode it in >> UTF-32? > > It would have to be hypothetical: Unicode is a 31-bit character set. Actually the Unicode codepoint range is 0 .. 10FFFF and therefore fits in 21 bits. ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <EBEKJMEEPPFAACCBBGNHAELNDIAA.randy@rrsoftware.com>]
* Re: System.WCh_Cnv [not found] <EBEKJMEEPPFAACCBBGNHAELNDIAA.randy@rrsoftware.com> @ 2006-07-25 10:31 ` Marius Amado-Alves 2006-07-25 12:21 ` System.WCh_Cnv Dmitry A. Kazakov 0 siblings, 1 reply; 21+ messages in thread From: Marius Amado-Alves @ 2006-07-25 10:31 UTC (permalink / raw) To: comp.lang.ada >> Actually the Unicode codepoint range is 0 .. 10FFFF and therefore >> fits in 21 bits. > > ... the definition would allow expansion to 31-bits (but no > further). The definition of some particular *encoding* namely UCS-4. Not of the "character set" range. Character = codepoint. And this stops at 10FFFF. And it will not be extended. IIRC both Organizations went on record on this. Silly maybe, but not per se. It has to do with variable length encodings. It facilitates search and verification. Now these encodings may be a bit silly, yes. I have been sketching a highly simplified, short, clear, logical, understandable, usable, no nonsense, package for Unicode. I have not been making much progress for several reasons. If someone wants to join that would be great. The first lines of the spec follow. -- Unico : no nonsense Unicode support for Ada -- (C) 2006 Marius Amado Alves with Ada.Containers.Vectors; with Ada.Streams; package Unico is type Character is range 0 .. 16#10FFFF#; for Character'Size use 24; procedure Write (Stream : access Ada.Streams.Root_Stream_Type'Class; Item : in Character); procedure Read (Stream : access Ada.Streams.Root_Stream_Type'Class; Item : out Character); for Character'Write use Write; for Character'Read use Read; package Strings is new Ada.Containers.Vectors (Index_Type => Positive, Element_Type => Character); subtype String is Strings.Vector; type Fixed_String is array (Positive range <>) of Character; for Fixed_String'Component_Size use 24; pragma Pack (Fixed_String); ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-25 10:31 ` System.WCh_Cnv Marius Amado-Alves @ 2006-07-25 12:21 ` Dmitry A. Kazakov 2006-07-25 13:03 ` System.WCh_Cnv Marius Amado-Alves 0 siblings, 1 reply; 21+ messages in thread From: Dmitry A. Kazakov @ 2006-07-25 12:21 UTC (permalink / raw) On Tue, 25 Jul 2006 11:31:08 +0100, Marius Amado-Alves wrote: >>> Actually the Unicode codepoint range is 0 .. 10FFFF and therefore >>> fits in 21 bits. >> >> ... the definition would allow expansion to 31-bits (but no >> further). > > The definition of some particular *encoding* namely UCS-4. Not of the > "character set" range. Character = codepoint. And this stops at > 10FFFF. And it will not be extended. IIRC both Organizations went on > record on this. Silly maybe, but not per se. It has to do with > variable length encodings. It facilitates search and verification. > Now these encodings may be a bit silly, yes. > > I have been sketching a highly simplified, short, clear, logical, > understandable, usable, no nonsense, package for Unicode. I have not > been making much progress for several reasons. If someone wants to > join that would be great. The first lines of the spec follow. > > -- Unico : no nonsense Unicode support for Ada > -- (C) 2006 Marius Amado Alves > > with Ada.Containers.Vectors; > with Ada.Streams; > > package Unico is > > type Character is range 0 .. 16#10FFFF#; > for Character'Size use 24; > > procedure Write > (Stream : access Ada.Streams.Root_Stream_Type'Class; > Item : in Character); > > procedure Read > (Stream : access Ada.Streams.Root_Stream_Type'Class; > Item : out Character); [...] But how can you read/write it ignoring encoding? As for Character = code point idea, I think it was a wrong from its very start in the form of Wide_Character. The advantages of being able to index each individual code point in a string are minor comparing with the mess it brings with. These become almost invisible if one takes into account that places where that might be needed, like text rendering, don't work on per code point basis anyway. So I'm quite happy with UTF-8 and plain strings. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-25 12:21 ` System.WCh_Cnv Dmitry A. Kazakov @ 2006-07-25 13:03 ` Marius Amado-Alves 2006-07-25 13:36 ` System.WCh_Cnv Dmitry A. Kazakov 2006-07-25 14:09 ` System.WCh_Cnv Georg Bauhaus 0 siblings, 2 replies; 21+ messages in thread From: Marius Amado-Alves @ 2006-07-25 13:03 UTC (permalink / raw) To: comp.lang.ada > places where that might be needed, like text rendering, don't work > on per > code point basis anyway.... Exactly. And that is wrong, and I want to fix it. > So I'm quite happy with UTF-8 and plain strings. I am more or less happy with this too [1], but I think we can do better. With UTF-8 in strings the two abstractions (codepoints, encodings) are too entangled for my taste. In rigour you cannot use the standard string operations. I mean you can but must fiddle with the encodings i.e. you are not searching for a codepoint but for a particular encoding. Instead I want to be able to write things like for I in Str'Range loop if Str (I) = Euro_Sign then ... end loop; I cannot do that with UTF-8 in strings. Note that Wide_Wide_String is of little help here, because of the endianess issue. But it might be a good idea to base Unico on Wide_Wide_String for closeness to the standard. [1] What makes me happy about UTF-8 is that it seems to have become a de facto default, common denominator encoding. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-25 13:03 ` System.WCh_Cnv Marius Amado-Alves @ 2006-07-25 13:36 ` Dmitry A. Kazakov 2006-07-25 14:09 ` System.WCh_Cnv Georg Bauhaus 1 sibling, 0 replies; 21+ messages in thread From: Dmitry A. Kazakov @ 2006-07-25 13:36 UTC (permalink / raw) On Tue, 25 Jul 2006 14:03:21 +0100, Marius Amado-Alves wrote: >> So I'm quite happy with UTF-8 and plain strings. > > I am more or less happy with this too [1], but I think we can do > better. With UTF-8 in strings the two abstractions (codepoints, > encodings) are too entangled for my taste. In rigour you cannot use > the standard string operations. Yes, not all of them. > I mean you can but must fiddle with > the encodings i.e. you are not searching for a codepoint but for a > particular encoding. Instead I want to be able to write things like > > for I in Str'Range loop > if Str (I) = Euro_Sign then ... > end loop; > > I cannot do that with UTF-8 in strings. I do it this way: declare Index : Integer := Str'First; Value : UTF8_Code_Point; begin while Index <= Str'Last loop Get (Str, Index, Value); if Euro_Sign then ... end loop; Actually if Ada had abstract array interfaces and inheritance we could have it in exactly the form you wrote it. Alas. Note that the pattern you refer is beyond just Unicode issues. Exactly the same problem exists in pattern matching: while Index <= Str'Last loop if Match (Str, Index, Pattern) then ... end loop; Basically it is a stream interface to strings with an ability to roll it back or, equivalently, to look ahead. > Note that Wide_Wide_String is > of little help here, because of the endianess issue. But it might be > a good idea to base Unico on Wide_Wide_String for closeness to the > standard. I prefer general solutions, like array interfaces. You have an opaque object. Add an array interface to it, which would return code points or Wide_x_100_Character or whatever you want. Here you are. > [1] What makes me happy about UTF-8 is that it seems to have become a > de facto default, common denominator encoding. Long live Linux! (:-)) -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: System.WCh_Cnv 2006-07-25 13:03 ` System.WCh_Cnv Marius Amado-Alves 2006-07-25 13:36 ` System.WCh_Cnv Dmitry A. Kazakov @ 2006-07-25 14:09 ` Georg Bauhaus 1 sibling, 0 replies; 21+ messages in thread From: Georg Bauhaus @ 2006-07-25 14:09 UTC (permalink / raw) On Tue, 2006-07-25 at 14:03 +0100, Marius Amado-Alves wrote: > for I in Str'Range loop > if Str (I) = Euro_Sign then ... > end loop; As you implied elsewhere while Has_Element(Str) loop if Element(Str) = Euro_Sign then ... end loop; UTF-anything is an external form for transmission of characters. So yes, perhaps internal Wide_Wide_Character will be well suited. When I tried Containers.Vectors in place of Unbounded_String, the container performance was very good. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2006-07-25 14:09 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-07-12 14:13 System.WCh_Cnv Y.Tomino 2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik 2006-07-12 18:57 ` System.WCh_Cnv Björn Persson 2006-07-13 17:24 ` System.WCh_Cnv demoonlit 2006-07-13 21:30 ` System.WCh_Cnv Björn Persson 2006-07-14 7:19 ` System.WCh_Cnv Dmitry A. Kazakov 2006-07-14 7:40 ` System.WCh_Cnv Martin Krischik 2006-07-14 12:18 ` System.WCh_Cnv Björn Persson 2006-07-16 11:41 ` System.WCh_Cnv Martin Krischik 2006-07-24 21:00 ` System.WCh_Cnv Björn Persson 2006-07-24 23:35 ` System.WCh_Cnv Randy Brukardt 2006-07-25 0:45 ` System.WCh_Cnv Marius Amado-Alves 2006-07-14 16:13 ` System.WCh_Cnv Georg Bauhaus 2006-07-12 18:57 ` System.WCh_Cnv Björn Persson 2006-07-13 17:34 ` System.WCh_Cnv demoonlit [not found] <8BB3B99E-16DA-4EBF-A2FE-50B079349CA9@amado-alves.info> 2006-07-25 0:45 ` System.WCh_Cnv Marius Amado-Alves [not found] <EBEKJMEEPPFAACCBBGNHAELNDIAA.randy@rrsoftware.com> 2006-07-25 10:31 ` System.WCh_Cnv Marius Amado-Alves 2006-07-25 12:21 ` System.WCh_Cnv Dmitry A. Kazakov 2006-07-25 13:03 ` System.WCh_Cnv Marius Amado-Alves 2006-07-25 13:36 ` System.WCh_Cnv Dmitry A. Kazakov 2006-07-25 14:09 ` System.WCh_Cnv Georg Bauhaus
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox