comp.lang.ada
 help / color / mirror / Atom feed
From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: windows-1251 to utf-8
Date: Wed, 31 Oct 2018 15:58:21 -0500
Date: 2018-10-31T15:58:21-05:00	[thread overview]
Message-ID: <prd51e$sp3$1@franka.jacob-sparre.dk> (raw)
In-Reply-To: prcn4v$d30$1@gioia.aioe.org

>Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
>news:prcn4v$d30$1@gioia.aioe.org...
> On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
>> Let's make it easier. For example:
>>
>> ------------------------------------------------------------------
>>
>> with Ada.Strings.Unbounded;     use Ada.Strings.Unbounded;
>> with Ada.Text_IO.Unbounded_IO;  use Ada.Text_IO.Unbounded_IO;
>>
>> with AWS.Client;            use AWS.Client;
>> with AWS.Messages;          use AWS.Messages;
>> with AWS.Response;          use AWS.Response;
>>
>> procedure Main is
>>
>>     HTML_Result   : Unbounded_String;
>>     Request_Header_List : Header_List;
>>
>> begin
>>
>>     Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 
>> (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
>>
>>     HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers 
>> => Request_Header_List));
>>
>>     Put_Line(HTML_Result);
>>
>> end Main;
>>
>> ------------------------------------------------------------------
>>
>> My linux terminal (default UTF-8) show: 
>> https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
>>
>> If set encoding in terminal Windows-1251 - all is well: 
>> https://photos.app.goo.gl/goN5g7uofD8rYLP79
>>
>> Are there standard ways to solve this problem?
>
> What problem? The page uses the content charset=windows-1251. It is legal.
>
> Your program is illegal as it prints the body using Put_Line. Ada standard 
> requires Character be Latin-1. The only case when your program would be 
> correct is when charset=ISO-8859-1.
>
> You must convert the page body according to the encoding specified by the 
> charset key into a string containing UTF-8 octets and use 
> Streams.Stream_IO to write these octets as-is. The conversion for the case 
> of windows-1251 I described earlier. Create a table Character'Pos 
> 0..255 -> Code_Point and use it for each "character" of HTML_Result.
>
> P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the 
> underlying OS.
>
> P.P.S. Technically AWS also ignores Ada standard. But that is an 
> established practice. Since there is no better way.

Right. Probably the easiest way to do this (using just Ada functions) would 
be to:

 (A)  Use Ada.Characters to convert the To_String of the unbounded string to 
a Wide_String, and then store that in a Wide_Unbounded_String (or is that a 
Unbounded_Wide_String?)
 (B) Use Ada.Strings.Wide_Maps to create a character conversion map (the 
conversions were described by another reply);
 (C) Use Ada.Strings.Wide_Unbounded.Translate to apply the mapping from (B) 
to your Wide_Unbounded_String.
(D) Use Ada.Strings.UTF_Encoding.Wide_Strings.Encode to convert 
To_Wide_String to your translated Wide_Unbounded_String, presumably storing 
the result into a Unbounded_String.

You potentially could skip (D) if Wide_Text_IO works when sent to 
Standard_Output (I'd expect that on Windows, no idea on Linux). In that 
case, use Wide_Text_IO.Put to send your result.

In any case, this shows why Unicode exists, and why anything these days that 
uses non-standard encodings is evil. There's really no short-cut to recoding 
such things, and that makes them maddening.

                                  Randy.





  reply	other threads:[~2018-10-31 20:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-31  2:57 windows-1251 to utf-8 eduardsapotski
2018-10-31  6:09 ` gautier_niouzes
2018-10-31 10:01 ` Dmitry A. Kazakov
2018-10-31 15:28 ` eduardsapotski
2018-10-31 16:50   ` Shark8
2018-10-31 17:01   ` Dmitry A. Kazakov
2018-10-31 20:58     ` Randy Brukardt [this message]
2018-11-01 12:49   ` Björn Lundin
2018-11-01 13:26     ` Dmitry A. Kazakov
2018-11-01 14:34       ` Björn Lundin
2018-11-01 18:14 ` Vadim Godunko
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox