From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: windows-1251 to utf-8
Date: Wed, 31 Oct 2018 15:58:21 -0500
Date: 2018-10-31T15:58:21-05:00 [thread overview]
Message-ID: <prd51e$sp3$1@franka.jacob-sparre.dk> (raw)
In-Reply-To: prcn4v$d30$1@gioia.aioe.org
>Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
>news:prcn4v$d30$1@gioia.aioe.org...
> On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
>> Let's make it easier. For example:
>>
>> ------------------------------------------------------------------
>>
>> with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
>> with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;
>>
>> with AWS.Client; use AWS.Client;
>> with AWS.Messages; use AWS.Messages;
>> with AWS.Response; use AWS.Response;
>>
>> procedure Main is
>>
>> HTML_Result : Unbounded_String;
>> Request_Header_List : Header_List;
>>
>> begin
>>
>> Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0
>> (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
>>
>> HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers
>> => Request_Header_List));
>>
>> Put_Line(HTML_Result);
>>
>> end Main;
>>
>> ------------------------------------------------------------------
>>
>> My linux terminal (default UTF-8) show:
>> https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
>>
>> If set encoding in terminal Windows-1251 - all is well:
>> https://photos.app.goo.gl/goN5g7uofD8rYLP79
>>
>> Are there standard ways to solve this problem?
>
> What problem? The page uses the content charset=windows-1251. It is legal.
>
> Your program is illegal as it prints the body using Put_Line. Ada standard
> requires Character be Latin-1. The only case when your program would be
> correct is when charset=ISO-8859-1.
>
> You must convert the page body according to the encoding specified by the
> charset key into a string containing UTF-8 octets and use
> Streams.Stream_IO to write these octets as-is. The conversion for the case
> of windows-1251 I described earlier. Create a table Character'Pos
> 0..255 -> Code_Point and use it for each "character" of HTML_Result.
>
> P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the
> underlying OS.
>
> P.P.S. Technically AWS also ignores Ada standard. But that is an
> established practice. Since there is no better way.
Right. Probably the easiest way to do this (using just Ada functions) would
be to:
(A) Use Ada.Characters to convert the To_String of the unbounded string to
a Wide_String, and then store that in a Wide_Unbounded_String (or is that a
Unbounded_Wide_String?)
(B) Use Ada.Strings.Wide_Maps to create a character conversion map (the
conversions were described by another reply);
(C) Use Ada.Strings.Wide_Unbounded.Translate to apply the mapping from (B)
to your Wide_Unbounded_String.
(D) Use Ada.Strings.UTF_Encoding.Wide_Strings.Encode to convert
To_Wide_String to your translated Wide_Unbounded_String, presumably storing
the result into a Unbounded_String.
You potentially could skip (D) if Wide_Text_IO works when sent to
Standard_Output (I'd expect that on Windows, no idea on Linux). In that
case, use Wide_Text_IO.Put to send your result.
In any case, this shows why Unicode exists, and why anything these days that
uses non-standard encodings is evil. There's really no short-cut to recoding
such things, and that makes them maddening.
Randy.
next prev parent reply other threads:[~2018-10-31 20:58 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-31 2:57 windows-1251 to utf-8 eduardsapotski
2018-10-31 6:09 ` gautier_niouzes
2018-10-31 10:01 ` Dmitry A. Kazakov
2018-10-31 15:28 ` eduardsapotski
2018-10-31 16:50 ` Shark8
2018-10-31 17:01 ` Dmitry A. Kazakov
2018-10-31 20:58 ` Randy Brukardt [this message]
2018-11-01 12:49 ` Björn Lundin
2018-11-01 13:26 ` Dmitry A. Kazakov
2018-11-01 14:34 ` Björn Lundin
2018-11-01 18:14 ` Vadim Godunko
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox