* Re: windows-1251 to utf-8
@ 2018-10-31 20:58 5% ` Randy Brukardt
0 siblings, 0 replies; 5+ results
From: Randy Brukardt @ 2018-10-31 20:58 UTC (permalink / raw)
>Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
>news:prcn4v$d30$1@gioia.aioe.org...
> On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
>> Let's make it easier. For example:
>>
>> ------------------------------------------------------------------
>>
>> with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
>> with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;
>>
>> with AWS.Client; use AWS.Client;
>> with AWS.Messages; use AWS.Messages;
>> with AWS.Response; use AWS.Response;
>>
>> procedure Main is
>>
>> HTML_Result : Unbounded_String;
>> Request_Header_List : Header_List;
>>
>> begin
>>
>> Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0
>> (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
>>
>> HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers
>> => Request_Header_List));
>>
>> Put_Line(HTML_Result);
>>
>> end Main;
>>
>> ------------------------------------------------------------------
>>
>> My linux terminal (default UTF-8) show:
>> https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
>>
>> If set encoding in terminal Windows-1251 - all is well:
>> https://photos.app.goo.gl/goN5g7uofD8rYLP79
>>
>> Are there standard ways to solve this problem?
>
> What problem? The page uses the content charset=windows-1251. It is legal.
>
> Your program is illegal as it prints the body using Put_Line. Ada standard
> requires Character be Latin-1. The only case when your program would be
> correct is when charset=ISO-8859-1.
>
> You must convert the page body according to the encoding specified by the
> charset key into a string containing UTF-8 octets and use
> Streams.Stream_IO to write these octets as-is. The conversion for the case
> of windows-1251 I described earlier. Create a table Character'Pos
> 0..255 -> Code_Point and use it for each "character" of HTML_Result.
>
> P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the
> underlying OS.
>
> P.P.S. Technically AWS also ignores Ada standard. But that is an
> established practice. Since there is no better way.
Right. Probably the easiest way to do this (using just Ada functions) would
be to:
(A) Use Ada.Characters to convert the To_String of the unbounded string to
a Wide_String, and then store that in a Wide_Unbounded_String (or is that a
Unbounded_Wide_String?)
(B) Use Ada.Strings.Wide_Maps to create a character conversion map (the
conversions were described by another reply);
(C) Use Ada.Strings.Wide_Unbounded.Translate to apply the mapping from (B)
to your Wide_Unbounded_String.
(D) Use Ada.Strings.UTF_Encoding.Wide_Strings.Encode to convert
To_Wide_String to your translated Wide_Unbounded_String, presumably storing
the result into a Unbounded_String.
You potentially could skip (D) if Wide_Text_IO works when sent to
Standard_Output (I'd expect that on Windows, no idea on Linux). In that
case, use Wide_Text_IO.Put to send your result.
In any case, this shows why Unicode exists, and why anything these days that
uses non-standard encodings is evil. There's really no short-cut to recoding
such things, and that makes them maddening.
Randy.
^ permalink raw reply [relevance 5%]
* Re: unicode and wide_text_io
@ 2017-12-28 22:35 7% ` G.B.
0 siblings, 0 replies; 5+ results
From: G.B. @ 2017-12-28 22:35 UTC (permalink / raw)
On 28.12.17 16:47, 00120260b@gmail.com wrote:
> Then, how come the norm hasn't made it a bit easier to input/ouput post-latin-1 characters ? Why aren't other norms/characters set/encodings more like special cases ?
>
Actually, output of non-7-bit, unambiguously encoded text
has been made reasonably easy, I'd say, also defaulting
to what should be expected:
with Ada.Wide_Text_IO.Text_Streams;
with Ada.Strings.UTF_Encoding.Wide_Strings;
procedure UTF is
-- USD/EUR, i.e. "$/€"
Ratio : constant Wide_String := "$/" & Wide_Character'Val (16#20AC#);
use Ada.Wide_Text_Io, Ada.Strings;
begin
Put_Line (Ratio); -- use defaults, traditional
String'Write -- stream output, force UTF-8
(Text_Streams.Stream (Current_Output),
UTF_Encoding.Wide_Strings.Encode (Ratio));
end UTF;
The above source text uses only 7 bit encoding for post-
latin-1 strings. Only comment text is using a wide_character.
If, instead, source text is encoded by "more" bits, and using
post-latin-1 literals or identifiers, then the compiler
may need to be told. I think that BOMs may be of use, and
in any case, there are compiler switches or some other
vendor specific vocabulary describing source text.
^ permalink raw reply [relevance 7%]
* Re: Convert wide_string to string (as the same byte array)
2012-03-06 15:54 0% ` Adam Beneschan
@ 2012-03-07 1:04 0% ` Randy Brukardt
0 siblings, 0 replies; 5+ results
From: Randy Brukardt @ 2012-03-07 1:04 UTC (permalink / raw)
"Adam Beneschan" <adam@irvine.com> wrote in message
news:5368448.8.1331049289886.JavaMail.geo-discussion-forums@pbbpr1...
> On Monday, March 5, 2012 5:58:48 PM UTC-8, Randy Brukardt wrote:
>>
>> An alternative to Adam's solution would be to use the Ada2012 encoding
>> functions (A.4.11), specifically Ada.Strings.UTF_Encoding.Wide_Strings,
>> and
>> use a UTF-8 encoding. That would be shorter, but not fixed length, so
>> whether that would work for you depends on the API you are feeding these
>> into.
>
> This may seem like a dumb question, but does that preserve order?
My understanding was that UTF-8 was designed so that ordinary byte
comparison operations would work "properly" on UTF-8 strings (presuming no
"overlong encodings" are used; there is no point in such things, it's like
including NOPs in your generated instructions). That's surely true if only
equality is involved; I believe it is also true for ordering, but as I've
never tried it I don't want to say for absolutely certain.
Randy.
^ permalink raw reply [relevance 0%]
* Re: Convert wide_string to string (as the same byte array)
2012-03-06 1:58 5% ` Randy Brukardt
@ 2012-03-06 15:54 0% ` Adam Beneschan
2012-03-07 1:04 0% ` Randy Brukardt
0 siblings, 1 reply; 5+ results
From: Adam Beneschan @ 2012-03-06 15:54 UTC (permalink / raw)
On Monday, March 5, 2012 5:58:48 PM UTC-8, Randy Brukardt wrote:
>
> An alternative to Adam's solution would be to use the Ada2012 encoding
> functions (A.4.11), specifically Ada.Strings.UTF_Encoding.Wide_Strings, and
> use a UTF-8 encoding. That would be shorter, but not fixed length, so
> whether that would work for you depends on the API you are feeding these
> into.
This may seem like a dumb question, but does that preserve order?
-- Adam
^ permalink raw reply [relevance 0%]
* Re: Convert wide_string to string (as the same byte array)
@ 2012-03-06 1:58 5% ` Randy Brukardt
2012-03-06 15:54 0% ` Adam Beneschan
0 siblings, 1 reply; 5+ results
From: Randy Brukardt @ 2012-03-06 1:58 UTC (permalink / raw)
"Erich" <john@peppermind.com> wrote in message
news:f88cc8ca-183a-40c7-a01c-2adc1137d845@b18g2000vbz.googlegroups.com...
>A newbie question: I need to convert a wide_string to a (platform/
> endian independent) string that represents all the bytes of the
> wide_string. How do you do that?
An alternative to Adam's solution would be to use the Ada2012 encoding
functions (A.4.11), specifically Ada.Strings.UTF_Encoding.Wide_Strings, and
use a UTF-8 encoding. That would be shorter, but not fixed length, so
whether that would work for you depends on the API you are feeding these
into.
Randy.
^ permalink raw reply [relevance 5%]
Results 1-5 of 5 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2012-02-24 22:01 Convert wide_string to string (as the same byte array) Erich
2012-03-06 1:58 5% ` Randy Brukardt
2012-03-06 15:54 0% ` Adam Beneschan
2012-03-07 1:04 0% ` Randy Brukardt
2017-12-27 18:08 unicode and wide_text_io Mehdi Saada
2017-12-28 13:15 ` Mehdi Saada
2017-12-28 14:25 ` Dmitry A. Kazakov
2017-12-28 14:32 ` Simon Wright
2017-12-28 15:28 ` Niklas Holsti
2017-12-28 15:47 ` 00120260b
2017-12-28 22:35 7% ` G.B.
2018-10-31 2:57 windows-1251 to utf-8 eduardsapotski
2018-10-31 15:28 ` eduardsapotski
2018-10-31 17:01 ` Dmitry A. Kazakov
2018-10-31 20:58 5% ` Randy Brukardt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox