comp.lang.ada
 help / color / mirror / Atom feed
* System.WCh_Cnv
@ 2006-07-12 14:13 Y.Tomino
  2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik
  2006-07-12 18:57 ` System.WCh_Cnv Björn Persson
  0 siblings, 2 replies; 21+ messages in thread
From: Y.Tomino @ 2006-07-12 14:13 UTC (permalink / raw)


Hello.

Ada of gcc-4.1, Why UTF-32 routines of System.WCh_Cnv handle JIS
character code when WCEM_Shift_JIS or WCEM_EUC ? Other options seem to
mean Unicode.
JIS character code is not compatible with Unicode.
Also, why it does not use C-runtime locale functions like mbstowcs.
I'm getting confused.

-- 
YT



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-12 14:13 System.WCh_Cnv Y.Tomino
@ 2006-07-12 15:51 ` Martin Krischik
  2006-07-12 18:57   ` System.WCh_Cnv Björn Persson
  2006-07-13 17:24   ` System.WCh_Cnv demoonlit
  2006-07-12 18:57 ` System.WCh_Cnv Björn Persson
  1 sibling, 2 replies; 21+ messages in thread
From: Martin Krischik @ 2006-07-12 15:51 UTC (permalink / raw)


Y.Tomino wrote:

> Ada of gcc-4.1, Why UTF-32 routines of System.WCh_Cnv handle JIS
> character code when WCEM_Shift_JIS or WCEM_EUC ? Other options seem to
> mean Unicode.
> JIS character code is not compatible with Unicode.
> Also, why it does not use C-runtime locale functions like mbstowcs.
> I'm getting confused.

Well it's a System.* package and most System packages are for internal
compiler use. IS you want propper unicode support you should use XML/Ada.

You can get a copy from gnuada.sf.net.

Martin
-- 
mailto://krischik@users.sourceforge.net
Ada programming at: http://ada.krischik.com



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-12 14:13 System.WCh_Cnv Y.Tomino
  2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik
@ 2006-07-12 18:57 ` Björn Persson
  2006-07-13 17:34   ` System.WCh_Cnv demoonlit
  1 sibling, 1 reply; 21+ messages in thread
From: Björn Persson @ 2006-07-12 18:57 UTC (permalink / raw)


Y.Tomino wrote:
> Ada of gcc-4.1, Why UTF-32 routines of System.WCh_Cnv handle JIS
> character code when WCEM_Shift_JIS or WCEM_EUC ? Other options seem to
> mean Unicode.

If you happen to be using some Unix-like operating system which is 
Posix-conformant enough to have the Iconv library, then you might want 
to use the EAstrings packages in AdaCL (http://adacl.sourceforge.net/). 
EAstrings converts automatically between all the encodings your Iconv 
library knows.

Otherwise, there is System.WCh_JIS, which is supposed to handle EUC and 
Shift-JIS. Have you tried that?

(I still need to get EAstrings working in Windows. I could really use 
some help from a Windows programmer.)

-- 
Bj�rn Persson                              PGP key A88682FD
                    omb jor ers @sv ge.
                    r o.b n.p son eri nu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik
@ 2006-07-12 18:57   ` Björn Persson
  2006-07-13 17:24   ` System.WCh_Cnv demoonlit
  1 sibling, 0 replies; 21+ messages in thread
From: Björn Persson @ 2006-07-12 18:57 UTC (permalink / raw)


Martin Krischik wrote:

> IS you want propper unicode support you should use XML/Ada.

Does XML/Ada support any JIS encodings? I thought it had only some ISO 
8859 and UTF encodings.

-- 
Bj�rn Persson                              PGP key A88682FD
                    omb jor ers @sv ge.
                    r o.b n.p son eri nu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik
  2006-07-12 18:57   ` System.WCh_Cnv Björn Persson
@ 2006-07-13 17:24   ` demoonlit
  2006-07-13 21:30     ` System.WCh_Cnv Björn Persson
  1 sibling, 1 reply; 21+ messages in thread
From: demoonlit @ 2006-07-13 17:24 UTC (permalink / raw)


Martin Krischik wrote:
> Well it's a System.* package and most System packages are for internal
> compiler use.

Thank you, and yes, System.WCh_Cnv is internal used by Wide_Text_IO and
compiler.
Then, text read with Wide_Text_IO be JIS character code...because my
Windows comand prompt used Shift-JIS.
So I'd convert JIS character code to other code set(with iconv,
mbstowcs, etc). Raw JIS character code values are not popular in Japan.
Text data often encoded as Shift-JIS, EUC or UTF-8.
And other Ada packages assume Wide_String as Unicode. I think it's
natural, because C-functions assume wchar_t as Unicode.
If at all possible, I want take Wide_String as Unicode same as
C-runtime functions.
System.WCh_Cnv confound JIS character code with Unicode, it makes
troubles. Wide_Text_IO (and -gnatWs, -gantWe) are useless in fact,
because there is no what uses JIS character code as it is, conversion
is needed after all.
So, I want to know why System.WCh_Cnv takes JIS character code?

-- 
YT




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-12 18:57 ` System.WCh_Cnv Björn Persson
@ 2006-07-13 17:34   ` demoonlit
  0 siblings, 0 replies; 21+ messages in thread
From: demoonlit @ 2006-07-13 17:34 UTC (permalink / raw)


I don't want to convert EUC <=> Shift JIS now, and want to use Wide_*
correctly.
JIS character code representation of Wide_String is meaningless...




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-13 17:24   ` System.WCh_Cnv demoonlit
@ 2006-07-13 21:30     ` Björn Persson
  2006-07-14  7:19       ` System.WCh_Cnv Dmitry A. Kazakov
  2006-07-14  7:40       ` System.WCh_Cnv Martin Krischik
  0 siblings, 2 replies; 21+ messages in thread
From: Björn Persson @ 2006-07-13 21:30 UTC (permalink / raw)


demoonlit@panathenaia.halfmoon.jp wrote:
> Windows comand prompt

OK, so your OS is Windows.

> So I'd convert JIS character code to other code set(with iconv,
> mbstowcs, etc).

Do you have Iconv? I thought that wasn't available in Windows.

> And other Ada packages assume Wide_String as Unicode. I think it's
> natural, because C-functions assume wchar_t as Unicode.
> If at all possible, I want take Wide_String as Unicode same as
> C-runtime functions.

Let's get some things straight so that we'll understand each other 
better. Unicode defines several character encodings. When you write 
"Unicode", do you mean UTF-8, UTF-16, UCS-4 (also called UTF-32) or 
UCS-2? Ada's Wide_String is UCS-2 and Wide_Wide_String is UCS-4. In C 
it's implementation-defined how wide a wide character is. As far as I 
know, wchar_t is usually 32 bits in Unix, so that a "wide string" is 
UCS-4. I hear Microsoft uses 16 bits for wchar_t, but I'm not sure 
whether a "wide string" in Windows is treated as UCS-2 or UTF-16.

> System.WCh_Cnv confound JIS character code with Unicode, it makes
> troubles. Wide_Text_IO (and -gnatWs, -gantWe) are useless in fact,
> because there is no what uses JIS character code as it is, conversion
> is needed after all.

I haven't used that package myself so I don't know how it works, but I 
won't be surprised if it's buggy. In my experience, Adacore's handling 
of character encodings is rather unimpressive.

> So, I want to know why System.WCh_Cnv takes JIS character code?

As it's a Gnat-specific package, only Adacore knows why they did it the 
way they did. You could ask them, but I don't think they'll answer this 
kind of questions unless you buy a support contract.

-- 
Bj�rn Persson                              PGP key A88682FD
                    omb jor ers @sv ge.
                    r o.b n.p son eri nu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-13 21:30     ` System.WCh_Cnv Björn Persson
@ 2006-07-14  7:19       ` Dmitry A. Kazakov
  2006-07-14  7:40       ` System.WCh_Cnv Martin Krischik
  1 sibling, 0 replies; 21+ messages in thread
From: Dmitry A. Kazakov @ 2006-07-14  7:19 UTC (permalink / raw)


On Thu, 13 Jul 2006 21:30:15 GMT, Bj�rn Persson wrote:

> I hear Microsoft uses 16 bits for wchar_t, but I'm not sure 
> whether a "wide string" in Windows is treated as UCS-2 or UTF-16.

The latter, AFAIK.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-13 21:30     ` System.WCh_Cnv Björn Persson
  2006-07-14  7:19       ` System.WCh_Cnv Dmitry A. Kazakov
@ 2006-07-14  7:40       ` Martin Krischik
  2006-07-14 12:18         ` System.WCh_Cnv Björn Persson
  2006-07-14 16:13         ` System.WCh_Cnv Georg Bauhaus
  1 sibling, 2 replies; 21+ messages in thread
From: Martin Krischik @ 2006-07-14  7:40 UTC (permalink / raw)



Björn Persson schrieb:

> Let's get some things straight so that we'll understand each other
> better. Unicode defines several character encodings. When you write
> "Unicode", do you mean UTF-8, UTF-16, UCS-4 (also called UTF-32) or
> UCS-2? Ada's Wide_String is UCS-2 and Wide_Wide_String is UCS-4.

I wonder about that. UCS character set are fixed length and UTF
character sets are variable lengt. So is it rigth  to say that  UCS-4
is UTF-32?

Martin




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-14  7:40       ` System.WCh_Cnv Martin Krischik
@ 2006-07-14 12:18         ` Björn Persson
  2006-07-16 11:41           ` System.WCh_Cnv Martin Krischik
  2006-07-14 16:13         ` System.WCh_Cnv Georg Bauhaus
  1 sibling, 1 reply; 21+ messages in thread
From: Björn Persson @ 2006-07-14 12:18 UTC (permalink / raw)


Martin Krischik wrote:
> I wonder about that. UCS character set are fixed length and UTF
> character sets are variable lengt. So is it rigth  to say that  UCS-4
> is UTF-32?

I believe every possible text will be encoded identically in UCS-4BE and 
UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a 
counter-example then I would like to see it. What character could take 
up more than one code unit in UTF-32?

-- 
Bj�rn Persson                              PGP key A88682FD
                    omb jor ers @sv ge.
                    r o.b n.p son eri nu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-14  7:40       ` System.WCh_Cnv Martin Krischik
  2006-07-14 12:18         ` System.WCh_Cnv Björn Persson
@ 2006-07-14 16:13         ` Georg Bauhaus
  1 sibling, 0 replies; 21+ messages in thread
From: Georg Bauhaus @ 2006-07-14 16:13 UTC (permalink / raw)


Martin Krischik wrote:
> Bj�rn Persson schrieb:
> 
>> Let's get some things straight so that we'll understand each other
>> better. Unicode defines several character encodings. When you write
>> "Unicode", do you mean UTF-8, UTF-16, UCS-4 (also called UTF-32) or
>> UCS-2? Ada's Wide_String is UCS-2 and Wide_Wide_String is UCS-4.
> 
> I wonder about that. UCS character set are fixed length and UTF
> character sets are variable lengt. So is it rigth  to say that  UCS-4
> is UTF-32?

UTF is a UCS Tranformation Format.
UCS is the Universal Multiple-Octet Coded Character Set.
UCS-4 is the canonical form of the UCS

Not sure whether UCS has anything to say about endianness.
UTF has.


-- Georg 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-14 12:18         ` System.WCh_Cnv Björn Persson
@ 2006-07-16 11:41           ` Martin Krischik
  2006-07-24 21:00             ` System.WCh_Cnv Björn Persson
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Krischik @ 2006-07-16 11:41 UTC (permalink / raw)


Bjï¿œrn Persson wrote:

> Martin Krischik wrote:
>> I wonder about that. UCS character set are fixed length and UTF
>> character sets are variable lengt. So is it rigth  to say that  UCS-4
>> is UTF-32?
> 
> I believe every possible text will be encoded identically in UCS-4BE and
> UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a
> counter-example then I would like to see it. What character could take
> up more than one code unit in UTF-32?

A few years ago you could have said the same replacing all '32' with '16'.
Many programmers relied on UTF-16 and UCS-2 being the the same. There where
no counter-examples at the time either. But one fine day in 2001 the
unicode authority(s) defined the 65537'th character...

I know that currently only 21 bits are actually used and the unicode
authority(s) have given up on using more codepoints. Still I am unsure of
just declaring them both the same.

Martin
-- 
mailto://krischik@users.sourceforge.net
Ada programming at: http://ada.krischik.com



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-16 11:41           ` System.WCh_Cnv Martin Krischik
@ 2006-07-24 21:00             ` Björn Persson
  2006-07-24 23:35               ` System.WCh_Cnv Randy Brukardt
  0 siblings, 1 reply; 21+ messages in thread
From: Björn Persson @ 2006-07-24 21:00 UTC (permalink / raw)


Martin Krischik wrote:
> Bjï¿œrn Persson wrote:
> 
>> Martin Krischik wrote:
>>> I wonder about that. UCS character set are fixed length and UTF
>>> character sets are variable lengt. So is it rigth  to say that  UCS-4
>>> is UTF-32?
>> I believe every possible text will be encoded identically in UCS-4BE and
>> UTF-32BE, as well as in UCS-4LE and UTF-32LE. If you have a
>> counter-example then I would like to see it. What character could take
>> up more than one code unit in UTF-32?
> 
> A few years ago you could have said the same replacing all '32' with '16'.
> Many programmers relied on UTF-16 and UCS-2 being the the same. There where
> no counter-examples at the time either. But one fine day in 2001 the
> unicode authority(s) defined the 65537'th character...
> 
> I know that currently only 21 bits are actually used and the unicode
> authority(s) have given up on using more codepoints. Still I am unsure of
> just declaring them both the same.

How about a *hypothetical* counter-example? If you had a character with
the code point 100000000 hexadecimal, how would you encode it in UTF-32?
I believe it's impossible; I believe UTF-32 is a fixed-width encoding.

I found what looks like the definition of UTF-32 in
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, page 76 (actually
page 23 in the file). It says:

"UTF-32 encoding form: The Unicode encoding form which assigns each
Unicode scalar value to a single unsigned 32-bit code unit with the same
numeric value as the Unicode scalar value."

Note "single".

Also, in Unicode Technical Report #17, at
http://www.unicode.org/reports/tr17/, UTF-32 is listed under "Examples
of fixed-width encoding forms", while UTF-16 is listed under "Examples
of variable-width encoding forms".

Of course, should the Unicode consortium make the unwise decision to
change the definition of UTF-32, then it might no longer be equivalent
to UCS-4, but then it would no longer be UTF-32. I would be a different
encoding, and would deserve a different name.

-- 
Bjï¿œrn Persson                              PGP key A88682FD
                    omb jor ers @sv ge.
                    r o.b n.p son eri nu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-24 21:00             ` System.WCh_Cnv Björn Persson
@ 2006-07-24 23:35               ` Randy Brukardt
  2006-07-25  0:45                 ` System.WCh_Cnv Marius Amado-Alves
  0 siblings, 1 reply; 21+ messages in thread
From: Randy Brukardt @ 2006-07-24 23:35 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 786 bytes --]

"Bj�rn Persson" <spam-away@nowhere.nil> wrote in message
news:Sxaxg.10370$E02.3445@newsb.telia.net...
>...
> How about a *hypothetical* counter-example? If you had a character with
> the code point 100000000 hexadecimal, how would you encode it in UTF-32?
> I believe it's impossible; I believe UTF-32 is a fixed-width encoding.

It would have to be hypothetical: Unicode is a 31-bit character set. Note
that I said 31 bits, not 32-bits.

Thus, UTF-32 and UCS-4 are the same *if encoding Unicode characters*.
(UTF-32 would need extra bytes to encode 32-bit characters with the high bit
on, but those would not be Unicode characters.)

Of course, some future character set could use more then 31-bits, but that
seems well into the future.

                         Randy Brukardt.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-24 23:35               ` System.WCh_Cnv Randy Brukardt
@ 2006-07-25  0:45                 ` Marius Amado-Alves
  0 siblings, 0 replies; 21+ messages in thread
From: Marius Amado-Alves @ 2006-07-25  0:45 UTC (permalink / raw)
  To: Randy Brukardt; +Cc: comp.lang.ada

>> How about a *hypothetical* counter-example? If you had a character  
>> with
>> the code point 100000000 hexadecimal, how would you encode it in  
>> UTF-32?
>
> It would have to be hypothetical: Unicode is a 31-bit character set.

Actually the Unicode codepoint range is 0 .. 10FFFF and therefore  
fits in 21 bits.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
       [not found] <8BB3B99E-16DA-4EBF-A2FE-50B079349CA9@amado-alves.info>
@ 2006-07-25  0:45 ` Marius Amado-Alves
  0 siblings, 0 replies; 21+ messages in thread
From: Marius Amado-Alves @ 2006-07-25  0:45 UTC (permalink / raw)
  To: comp.lang.ada

>> How about a *hypothetical* counter-example? If you had a character  
>> with
>> the code point 100000000 hexadecimal, how would you encode it in  
>> UTF-32?
>
> It would have to be hypothetical: Unicode is a 31-bit character set.

Actually the Unicode codepoint range is 0 .. 10FFFF and therefore  
fits in 21 bits.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
       [not found] <EBEKJMEEPPFAACCBBGNHAELNDIAA.randy@rrsoftware.com>
@ 2006-07-25 10:31 ` Marius Amado-Alves
  2006-07-25 12:21   ` System.WCh_Cnv Dmitry A. Kazakov
  0 siblings, 1 reply; 21+ messages in thread
From: Marius Amado-Alves @ 2006-07-25 10:31 UTC (permalink / raw)
  To: comp.lang.ada

>> Actually the Unicode codepoint range is 0 .. 10FFFF and therefore
>> fits in 21 bits.
>
> ... the definition would allow expansion to 31-bits (but no
> further).

The definition of some particular *encoding* namely UCS-4. Not of the  
"character set" range. Character = codepoint. And this stops at  
10FFFF. And it will not be extended. IIRC both Organizations went on  
record on this. Silly maybe, but not per se. It has to do with  
variable length encodings. It facilitates search and verification.  
Now these encodings may be a bit silly, yes.

I have been sketching a highly simplified, short, clear, logical,  
understandable, usable, no nonsense, package for Unicode. I have not  
been making much progress for several reasons. If someone wants to  
join that would be great. The first lines of the spec follow.

-- Unico : no nonsense Unicode support for Ada
-- (C) 2006 Marius Amado Alves

with Ada.Containers.Vectors;
with Ada.Streams;

package Unico is

    type Character is range 0 .. 16#10FFFF#;
    for Character'Size use 24;

    procedure Write
      (Stream : access Ada.Streams.Root_Stream_Type'Class;
       Item   : in Character);

    procedure Read
      (Stream : access Ada.Streams.Root_Stream_Type'Class;
       Item   : out Character);

    for Character'Write use Write;
    for Character'Read use Read;

    package Strings is new Ada.Containers.Vectors
      (Index_Type => Positive, Element_Type => Character);

    subtype String is Strings.Vector;

    type Fixed_String is array (Positive range <>) of Character;
    for Fixed_String'Component_Size use 24;
    pragma Pack (Fixed_String);





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-25 10:31 ` System.WCh_Cnv Marius Amado-Alves
@ 2006-07-25 12:21   ` Dmitry A. Kazakov
  2006-07-25 13:03     ` System.WCh_Cnv Marius Amado-Alves
  0 siblings, 1 reply; 21+ messages in thread
From: Dmitry A. Kazakov @ 2006-07-25 12:21 UTC (permalink / raw)


On Tue, 25 Jul 2006 11:31:08 +0100, Marius Amado-Alves wrote:

>>> Actually the Unicode codepoint range is 0 .. 10FFFF and therefore
>>> fits in 21 bits.
>>
>> ... the definition would allow expansion to 31-bits (but no
>> further).
> 
> The definition of some particular *encoding* namely UCS-4. Not of the  
> "character set" range. Character = codepoint. And this stops at  
> 10FFFF. And it will not be extended. IIRC both Organizations went on  
> record on this. Silly maybe, but not per se. It has to do with  
> variable length encodings. It facilitates search and verification.  
> Now these encodings may be a bit silly, yes.
> 
> I have been sketching a highly simplified, short, clear, logical,  
> understandable, usable, no nonsense, package for Unicode. I have not  
> been making much progress for several reasons. If someone wants to  
> join that would be great. The first lines of the spec follow.
> 
> -- Unico : no nonsense Unicode support for Ada
> -- (C) 2006 Marius Amado Alves
> 
> with Ada.Containers.Vectors;
> with Ada.Streams;
> 
> package Unico is
> 
>     type Character is range 0 .. 16#10FFFF#;
>     for Character'Size use 24;
> 
>     procedure Write
>       (Stream : access Ada.Streams.Root_Stream_Type'Class;
>        Item   : in Character);
> 
>     procedure Read
>       (Stream : access Ada.Streams.Root_Stream_Type'Class;
>        Item   : out Character);
[...] 

But how can you read/write it ignoring encoding?

As for Character = code point idea, I think it was a wrong from its very
start in the form of Wide_Character. The advantages of being able to index
each individual code point in a string are minor comparing with the mess it
brings with. These become almost invisible if one takes into account that
places where that might be needed, like text rendering, don't work on per
code point basis anyway. So I'm quite happy with UTF-8 and plain strings.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-25 12:21   ` System.WCh_Cnv Dmitry A. Kazakov
@ 2006-07-25 13:03     ` Marius Amado-Alves
  2006-07-25 13:36       ` System.WCh_Cnv Dmitry A. Kazakov
  2006-07-25 14:09       ` System.WCh_Cnv Georg Bauhaus
  0 siblings, 2 replies; 21+ messages in thread
From: Marius Amado-Alves @ 2006-07-25 13:03 UTC (permalink / raw)
  To: comp.lang.ada

> places where that might be needed, like text rendering, don't work  
> on per
> code point basis anyway....

Exactly. And that is wrong, and I want to fix it.

> So I'm quite happy with UTF-8 and plain strings.

I am more or less happy with this too [1], but I think we can do  
better. With UTF-8 in strings the two abstractions (codepoints,  
encodings) are too entangled for my taste. In rigour you cannot use  
the standard string operations. I mean you can but must fiddle with  
the encodings i.e. you are not searching for a codepoint but for a  
particular encoding. Instead I want to be able to write things like

for I in Str'Range loop
    if Str (I) = Euro_Sign then ...
end loop;

I cannot do that with UTF-8 in strings. Note that Wide_Wide_String is  
of little help here, because of the endianess issue. But it might be  
a good idea to base Unico on Wide_Wide_String for closeness to the  
standard.

[1] What makes me happy about UTF-8 is that it seems to have become a  
de facto default, common denominator encoding.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-25 13:03     ` System.WCh_Cnv Marius Amado-Alves
@ 2006-07-25 13:36       ` Dmitry A. Kazakov
  2006-07-25 14:09       ` System.WCh_Cnv Georg Bauhaus
  1 sibling, 0 replies; 21+ messages in thread
From: Dmitry A. Kazakov @ 2006-07-25 13:36 UTC (permalink / raw)


On Tue, 25 Jul 2006 14:03:21 +0100, Marius Amado-Alves wrote:

>> So I'm quite happy with UTF-8 and plain strings.
> 
> I am more or less happy with this too [1], but I think we can do  
> better. With UTF-8 in strings the two abstractions (codepoints,  
> encodings) are too entangled for my taste. In rigour you cannot use  
> the standard string operations.

Yes, not all of them.

> I mean you can but must fiddle with  
> the encodings i.e. you are not searching for a codepoint but for a  
> particular encoding. Instead I want to be able to write things like
> 
> for I in Str'Range loop
>     if Str (I) = Euro_Sign then ...
> end loop;
>
> I cannot do that with UTF-8 in strings.

I do it this way:

declare
   Index : Integer := Str'First;
   Value : UTF8_Code_Point;  
begin
   while Index <= Str'Last loop
      Get (Str, Index, Value);
      if Euro_Sign then ...
   end loop;

Actually if Ada had abstract array interfaces and inheritance we could have
it in exactly the form you wrote it. Alas.

Note that the pattern you refer is beyond just Unicode issues. Exactly the
same problem exists in pattern matching:

while Index <= Str'Last loop
    if Match (Str, Index, Pattern) then ...
end loop;

Basically it is a stream interface to strings with an ability to roll it
back or, equivalently, to look ahead.

> Note that Wide_Wide_String is  
> of little help here, because of the endianess issue. But it might be  
> a good idea to base Unico on Wide_Wide_String for closeness to the  
> standard.

I prefer general solutions, like array interfaces. You have an opaque
object. Add an array interface to it, which would return code points or
Wide_x_100_Character or whatever you want. Here you are.

> [1] What makes me happy about UTF-8 is that it seems to have become a  
> de facto default, common denominator encoding.

Long live Linux! (:-))

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: System.WCh_Cnv
  2006-07-25 13:03     ` System.WCh_Cnv Marius Amado-Alves
  2006-07-25 13:36       ` System.WCh_Cnv Dmitry A. Kazakov
@ 2006-07-25 14:09       ` Georg Bauhaus
  1 sibling, 0 replies; 21+ messages in thread
From: Georg Bauhaus @ 2006-07-25 14:09 UTC (permalink / raw)


On Tue, 2006-07-25 at 14:03 +0100, Marius Amado-Alves wrote:

> for I in Str'Range loop
>     if Str (I) = Euro_Sign then ...
> end loop;

As you implied elsewhere

  while Has_Element(Str) loop
      if Element(Str) = Euro_Sign then ...
  end loop;

UTF-anything is an external form for transmission of characters.
So yes, perhaps internal Wide_Wide_Character will be well suited.
When I tried Containers.Vectors in place of Unbounded_String,
the container performance was very good.






^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2006-07-25 14:09 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-12 14:13 System.WCh_Cnv Y.Tomino
2006-07-12 15:51 ` System.WCh_Cnv Martin Krischik
2006-07-12 18:57   ` System.WCh_Cnv Björn Persson
2006-07-13 17:24   ` System.WCh_Cnv demoonlit
2006-07-13 21:30     ` System.WCh_Cnv Björn Persson
2006-07-14  7:19       ` System.WCh_Cnv Dmitry A. Kazakov
2006-07-14  7:40       ` System.WCh_Cnv Martin Krischik
2006-07-14 12:18         ` System.WCh_Cnv Björn Persson
2006-07-16 11:41           ` System.WCh_Cnv Martin Krischik
2006-07-24 21:00             ` System.WCh_Cnv Björn Persson
2006-07-24 23:35               ` System.WCh_Cnv Randy Brukardt
2006-07-25  0:45                 ` System.WCh_Cnv Marius Amado-Alves
2006-07-14 16:13         ` System.WCh_Cnv Georg Bauhaus
2006-07-12 18:57 ` System.WCh_Cnv Björn Persson
2006-07-13 17:34   ` System.WCh_Cnv demoonlit
     [not found] <8BB3B99E-16DA-4EBF-A2FE-50B079349CA9@amado-alves.info>
2006-07-25  0:45 ` System.WCh_Cnv Marius Amado-Alves
     [not found] <EBEKJMEEPPFAACCBBGNHAELNDIAA.randy@rrsoftware.com>
2006-07-25 10:31 ` System.WCh_Cnv Marius Amado-Alves
2006-07-25 12:21   ` System.WCh_Cnv Dmitry A. Kazakov
2006-07-25 13:03     ` System.WCh_Cnv Marius Amado-Alves
2006-07-25 13:36       ` System.WCh_Cnv Dmitry A. Kazakov
2006-07-25 14:09       ` System.WCh_Cnv Georg Bauhaus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox