string and wide string usage

comp.lang.ada
 help / color / mirror / Atom feed

* string and wide string usage
@ 2013-03-07 11:12 Ali Bendriss
  2013-03-07 14:20 ` ytomino
  2013-03-07 17:48 ` Vadim Godunko
  0 siblings, 2 replies; 7+ messages in thread
From: Ali Bendriss @ 2013-03-07 11:12 UTC (permalink / raw)


Hello,

I've got a small program that read some value from an ldap server and 
copy them in a posgres database.
the function reading the ldap value return an unbounded_string, then I 
use to_string to feed postgres (using gnatcoll).

I've got some problem with some string in example:
a base 64 encoded string
V2luZG93c8KgNyBQcm9mZXNzaW9ubmVsIE4=
wich decode to 'Windows\xa07 Professionnel N' in utf-8
every thing is working if I feed directly the database, but if want to 
apply Ada.Characters.Handling.To_Lower on the string before feeding the 
database postgres is not happy 
'ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0xa0 0x37'
it's not really a big deal, but I would like to understand where the 
problem is. Do I have to use wide string ?

thanks,

Ali



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: string and wide string usage
  2013-03-07 11:12 string and wide string usage Ali Bendriss
@ 2013-03-07 14:20 ` ytomino
  2013-03-07 17:14   ` Dmitry A. Kazakov
  2013-03-07 23:53   ` Randy Brukardt
  2013-03-07 17:48 ` Vadim Godunko
  1 sibling, 2 replies; 7+ messages in thread
From: ytomino @ 2013-03-07 14:20 UTC (permalink / raw)


On Thursday, March 7, 2013 8:12:01 PM UTC+9, Ali Bendriss wrote:
> I've got some problem with some string in example:
> a base 64 encoded string
> V2luZG93c8KgNyBQcm9mZXNzaW9ubmVsIE4=
> wich decode to 'Windows\xa07 Professionnel N' in utf-8
> every thing is working if I feed directly the database, but if want to 
> apply Ada.Characters.Handling.To_Lower on the string before feeding the 
> database postgres is not happy 
> 'ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0xa0 0x37'
> it's not really a big deal, but I would like to understand where the 
> problem is. Do I have to use wide string ?

Because functions in Ada.Characters.Handling take not UTF-8 but Latin-1.
You have to
1. convert UTF-8 String to Wide_Wide_String, process UTF-32 and restore it to UTF-8.
  (Ada.Characters.Conversion also take Latin-1. You have to use GNAT.Encode_String/Decode_String or Ada.Strings.UTF_Encoding for converting.)
2. search a external library to process UTF-8 directly.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: string and wide string usage
  2013-03-07 14:20 ` ytomino
@ 2013-03-07 17:14   ` Dmitry A. Kazakov
  2013-03-07 23:53   ` Randy Brukardt
  1 sibling, 0 replies; 7+ messages in thread
From: Dmitry A. Kazakov @ 2013-03-07 17:14 UTC (permalink / raw)


On Thu, 7 Mar 2013 06:20:05 -0800 (PST), ytomino wrote:

> On Thursday, March 7, 2013 8:12:01 PM UTC+9, Ali Bendriss wrote:
>> I've got some problem with some string in example:
>> a base 64 encoded string
>> V2luZG93c8KgNyBQcm9mZXNzaW9ubmVsIE4=
>> wich decode to 'Windows\xa07 Professionnel N' in utf-8
>> every thing is working if I feed directly the database, but if want to 
>> apply Ada.Characters.Handling.To_Lower on the string before feeding the 
>> database postgres is not happy 
>> 'ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0xa0 0x37'
>> it's not really a big deal, but I would like to understand where the 
>> problem is. Do I have to use wide string ?
> 
> Because functions in Ada.Characters.Handling take not UTF-8 but Latin-1.
> You have to
> 1. convert UTF-8 String to Wide_Wide_String, process UTF-32 and restore it to UTF-8.
>   (Ada.Characters.Conversion also take Latin-1. You have to use GNAT.Encode_String/Decode_String or Ada.Strings.UTF_Encoding for converting.)
> 2. search a external library to process UTF-8 directly.

Provided the base 64 encodes an UTF-8 string, which you wanted to convert
to lower case UTF-8 string using the Unicode lower case mapping, then you
can use

   function To_Lowercase (Value : String) return String;

from

http://www.dmitry-kazakov.de/ada/strings_edit.htm#7.6

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: string and wide string usage
  2013-03-07 11:12 string and wide string usage Ali Bendriss
  2013-03-07 14:20 ` ytomino
@ 2013-03-07 17:48 ` Vadim Godunko
  1 sibling, 0 replies; 7+ messages in thread
From: Vadim Godunko @ 2013-03-07 17:48 UTC (permalink / raw)


You can use Matreshka to decode Base-64 into its Universal_String, convert it into lower case and store into PostgreSQL database.

http://forge.ada-ru.org/matreshka/wiki



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: string and wide string usage
  2013-03-07 14:20 ` ytomino
  2013-03-07 17:14   ` Dmitry A. Kazakov
@ 2013-03-07 23:53   ` Randy Brukardt
  2013-03-08  2:05     ` Yannick Duchêne (Hibou57)
  1 sibling, 1 reply; 7+ messages in thread
From: Randy Brukardt @ 2013-03-07 23:53 UTC (permalink / raw)

"ytomino" <aghia05@gmail.com> wrote in message 
news:5e5e7e80-7d69-47e1-9550-19e2e0a211a9@googlegroups.com...
> On Thursday, March 7, 2013 8:12:01 PM UTC+9, Ali Bendriss wrote:
>> I've got some problem with some string in example:
>> a base 64 encoded string
>> V2luZG93c8KgNyBQcm9mZXNzaW9ubmVsIE4=
>> wich decode to 'Windows\xa07 Professionnel N' in utf-8
>> every thing is working if I feed directly the database, but if want to
>> apply Ada.Characters.Handling.To_Lower on the string before feeding the
>> database postgres is not happy
>> 'ERROR:  invalid byte sequence for encoding "UTF8": 0xe2 0xa0 0x37'
>> it's not really a big deal, but I would like to understand where the
>> problem is. Do I have to use wide string ?
>
> Because functions in Ada.Characters.Handling take not UTF-8 but Latin-1.

Right. The proper thing to do (for Ada 2012) is to use 
Ada.Characters.Wide_Handling (or Wide_Wide_Handling) to do the case 
conversion, after converting the UTF-8 into a Wide_String (or 
Wide_Wide_String).

If you're trying to do this in an older version of Ada, you'll have to find 
some library somewhere to do the job.

But I want to caution you that "converting to lower case" is not a great 
idea if you plan to support arbitrary Unicode strings. Such conversions are 
somewhat ambiguous, and tend to make strings appear similar that are 
different (and sometimes the reverse happens as well). Usually, the best 
plan is to store the strings unmodified and use Equal_Case_Insensitive to 
compare them (this uses the most accurate comparison defined by Unicode, and 
has the advantage of being guarenteed not to change in future character set 
standards, which is NOT true of conversion to lower case).

There is a nice example of this problem in the next chapter of the Ada 2012 
Rationale (although you'll have to wait untiil May to see it, unless you get 
the Ada User Journal).

I realize you may have no choice given the design of your database might not 
be in your control, and it might not matter if you don't plan to have Greek 
and Turkish characters in your data (to mention two of the most common where 
convert to lower case and Equal_Case_Insensitive give different answers for 
Wide_Strings).

                                     Randy.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: string and wide string usage
  2013-03-07 23:53   ` Randy Brukardt
@ 2013-03-08  2:05     ` Yannick Duchêne (Hibou57)
  2013-03-08  3:07       ` Randy Brukardt
  0 siblings, 1 reply; 7+ messages in thread
From: Yannick Duchêne (Hibou57) @ 2013-03-08  2:05 UTC (permalink / raw)


Le Fri, 08 Mar 2013 00:53:25 +0100, Randy Brukardt <randy@rrsoftware.com>  
a écrit:
> But I want to caution you that "converting to lower case" is not a great
> idea if you plan to support arbitrary Unicode strings. Such conversions  
> are
> somewhat ambiguous, and tend to make strings appear similar that are
> different (and sometimes the reverse happens as well).

If I'm not wrong, that's the reverse, the conversion to upper‑case which  
is the one with which you may loose the more.


-- 
“Syntactic sugar causes cancer of the semi-colons.” [1]
“Structured Programming supports the law of the excluded muddle.” [1]
[1]: Epigrams on Programming — Alan J. — P. Yale University



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: string and wide string usage
  2013-03-08  2:05     ` Yannick Duchêne (Hibou57)
@ 2013-03-08  3:07       ` Randy Brukardt
  0 siblings, 0 replies; 7+ messages in thread
From: Randy Brukardt @ 2013-03-08  3:07 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1505 bytes --]

"Yannick Duch�ne (Hibou57)" <yannick_duchene@yahoo.fr> wrote in message 
news:op.wtlur3cnule2fv@cardamome...
>Le Fri, 08 Mar 2013 00:53:25 +0100, Randy Brukardt <randy@rrsoftware.com> 
>a �crit:
>> But I want to caution you that "converting to lower case" is not a great
>> idea if you plan to support arbitrary Unicode strings. Such conversions 
>> are
>> somewhat ambiguous, and tend to make strings appear similar that are
>> different (and sometimes the reverse happens as well).
>
>If I'm not wrong, that's the reverse, the conversion to upper-case which 
>is the one with which you may loose the more.

You're right that converting to upper case is worse, but that was my point: 
don't convert to *anything*. It doesn't matter what you convert to, you lose 
information and get the wrong answer in some cases (Turkish I's, for 
instance). Rather, leave the text in it's original case and use 
Equal_Case_Insensitive to decide whether it matches something existing. 
That's the rule for Ada identifiers (don't know if Gnat actually follows 
that, though).

Admittedly, you can't do that with some databases, so that might not be an 
option for the OP - which is a reason not to use a database unless you 
really need transactions.

                                                              Randy.
-- 
"Syntactic sugar causes cancer of the semi-colons." [1]
"Structured Programming supports the law of the excluded muddle." [1]
[1]: Epigrams on Programming - Alan J. - P. Yale University 





^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-03-08  3:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-07 11:12 string and wide string usage Ali Bendriss
2013-03-07 14:20 ` ytomino
2013-03-07 17:14   ` Dmitry A. Kazakov
2013-03-07 23:53   ` Randy Brukardt
2013-03-08  2:05     ` Yannick Duchêne (Hibou57)
2013-03-08  3:07       ` Randy Brukardt
2013-03-07 17:48 ` Vadim Godunko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox