comp.lang.ada
 help / color / mirror / Atom feed
* windows-1251 to utf-8
@ 2018-10-31  2:57 eduardsapotski
  2018-10-31  6:09 ` gautier_niouzes
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: eduardsapotski @ 2018-10-31  2:57 UTC (permalink / raw)


I get HTML from web-server in windows-1251 encoding. 
How do convert HTML in windows-1251 to utf-8?
Thank.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-10-31  2:57 windows-1251 to utf-8 eduardsapotski
@ 2018-10-31  6:09 ` gautier_niouzes
  2018-10-31 10:01 ` Dmitry A. Kazakov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: gautier_niouzes @ 2018-10-31  6:09 UTC (permalink / raw)


Have a look here:

https://sf.net/p/wasabee/code/HEAD/tree/zrt_dev/common/wasabee-encoding.adb

HTH
G.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-10-31  2:57 windows-1251 to utf-8 eduardsapotski
  2018-10-31  6:09 ` gautier_niouzes
@ 2018-10-31 10:01 ` Dmitry A. Kazakov
  2018-10-31 15:28 ` eduardsapotski
  2018-11-01 18:14 ` Vadim Godunko
  3 siblings, 0 replies; 11+ messages in thread
From: Dmitry A. Kazakov @ 2018-10-31 10:01 UTC (permalink / raw)


On 2018-10-31 03:57, eduardsapotski@gmail.com wrote:
> I get HTML from web-server in windows-1251 encoding.
> How do convert HTML in windows-1251 to utf-8?

The encoding table is this:

    https://en.wikipedia.org/wiki/Windows-1251

The 7-bit codes correspond to UTF-8 directly. For 8-bit codes (for all 
codes actually) you take the number from the table, e.g. Cyrillic 
capital Ц -> 16#0426# and convert it to UTF-8 sequence using, for 
example this:

    http://www.dmitry-kazakov.de/ada/strings_edit.htm#7

The function Strings_Edit.UTF8.Image takes code point and returns UTF-8 
equivalent, so

    Strings_Edit.UTF8.Image (16#0426#)

gives Ц in UTF-8.

HTML is an unrelated story. Do you mean RFC 2396 escape sequences? This 
is an alternative representation that has nothing to do with Windows-1251.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-10-31  2:57 windows-1251 to utf-8 eduardsapotski
  2018-10-31  6:09 ` gautier_niouzes
  2018-10-31 10:01 ` Dmitry A. Kazakov
@ 2018-10-31 15:28 ` eduardsapotski
  2018-10-31 16:50   ` Shark8
                     ` (2 more replies)
  2018-11-01 18:14 ` Vadim Godunko
  3 siblings, 3 replies; 11+ messages in thread
From: eduardsapotski @ 2018-10-31 15:28 UTC (permalink / raw)


Let's make it easier. For example:

------------------------------------------------------------------

with Ada.Strings.Unbounded;     use Ada.Strings.Unbounded;
with Ada.Text_IO.Unbounded_IO;  use Ada.Text_IO.Unbounded_IO;

with AWS.Client;            use AWS.Client;
with AWS.Messages;          use AWS.Messages;
with AWS.Response;          use AWS.Response;

procedure Main is

   HTML_Result   : Unbounded_String;
   Request_Header_List : Header_List;

begin

   Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");

   HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List));

   Put_Line(HTML_Result);

end Main;

------------------------------------------------------------------

My linux terminal (default UTF-8) show: https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA

If set encoding in terminal Windows-1251 - all is well: https://photos.app.goo.gl/goN5g7uofD8rYLP79

Are there standard ways to solve this problem?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-10-31 15:28 ` eduardsapotski
@ 2018-10-31 16:50   ` Shark8
  2018-10-31 17:01   ` Dmitry A. Kazakov
  2018-11-01 12:49   ` Björn Lundin
  2 siblings, 0 replies; 11+ messages in thread
From: Shark8 @ 2018-10-31 16:50 UTC (permalink / raw)


> Are there standard ways to solve this problem?

I *think* you can use Character-mapping to translate from Windows-1251 to UTF-X... although I'm unsure if it has to be the same character-size.

Failing that, maybe Matreshka -- http://forge.ada-ru.org/matreshka -- has something for it. I haven't used Matreshka [yet] but there's supposedly a big Unicode/manipulation library in it.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-10-31 15:28 ` eduardsapotski
  2018-10-31 16:50   ` Shark8
@ 2018-10-31 17:01   ` Dmitry A. Kazakov
  2018-10-31 20:58     ` Randy Brukardt
  2018-11-01 12:49   ` Björn Lundin
  2 siblings, 1 reply; 11+ messages in thread
From: Dmitry A. Kazakov @ 2018-10-31 17:01 UTC (permalink / raw)


On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
> Let's make it easier. For example:
> 
> ------------------------------------------------------------------
> 
> with Ada.Strings.Unbounded;     use Ada.Strings.Unbounded;
> with Ada.Text_IO.Unbounded_IO;  use Ada.Text_IO.Unbounded_IO;
> 
> with AWS.Client;            use AWS.Client;
> with AWS.Messages;          use AWS.Messages;
> with AWS.Response;          use AWS.Response;
> 
> procedure Main is
> 
>     HTML_Result   : Unbounded_String;
>     Request_Header_List : Header_List;
> 
> begin
> 
>     Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
> 
>     HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List));
> 
>     Put_Line(HTML_Result);
> 
> end Main;
> 
> ------------------------------------------------------------------
> 
> My linux terminal (default UTF-8) show: https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
> 
> If set encoding in terminal Windows-1251 - all is well: https://photos.app.goo.gl/goN5g7uofD8rYLP79
> 
> Are there standard ways to solve this problem?

What problem? The page uses the content charset=windows-1251. It is legal.

Your program is illegal as it prints the body using Put_Line. Ada 
standard requires Character be Latin-1. The only case when your program 
would be correct is when charset=ISO-8859-1.

You must convert the page body according to the encoding specified by 
the charset key into a string containing UTF-8 octets and use 
Streams.Stream_IO to write these octets as-is. The conversion for the 
case of windows-1251 I described earlier. Create a table Character'Pos 
0..255 -> Code_Point and use it for each "character" of HTML_Result.

P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the 
underlying OS.

P.P.S. Technically AWS also ignores Ada standard. But that is an 
established practice. Since there is no better way.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-10-31 17:01   ` Dmitry A. Kazakov
@ 2018-10-31 20:58     ` Randy Brukardt
  0 siblings, 0 replies; 11+ messages in thread
From: Randy Brukardt @ 2018-10-31 20:58 UTC (permalink / raw)


>Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
>news:prcn4v$d30$1@gioia.aioe.org...
> On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
>> Let's make it easier. For example:
>>
>> ------------------------------------------------------------------
>>
>> with Ada.Strings.Unbounded;     use Ada.Strings.Unbounded;
>> with Ada.Text_IO.Unbounded_IO;  use Ada.Text_IO.Unbounded_IO;
>>
>> with AWS.Client;            use AWS.Client;
>> with AWS.Messages;          use AWS.Messages;
>> with AWS.Response;          use AWS.Response;
>>
>> procedure Main is
>>
>>     HTML_Result   : Unbounded_String;
>>     Request_Header_List : Header_List;
>>
>> begin
>>
>>     Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 
>> (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
>>
>>     HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers 
>> => Request_Header_List));
>>
>>     Put_Line(HTML_Result);
>>
>> end Main;
>>
>> ------------------------------------------------------------------
>>
>> My linux terminal (default UTF-8) show: 
>> https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
>>
>> If set encoding in terminal Windows-1251 - all is well: 
>> https://photos.app.goo.gl/goN5g7uofD8rYLP79
>>
>> Are there standard ways to solve this problem?
>
> What problem? The page uses the content charset=windows-1251. It is legal.
>
> Your program is illegal as it prints the body using Put_Line. Ada standard 
> requires Character be Latin-1. The only case when your program would be 
> correct is when charset=ISO-8859-1.
>
> You must convert the page body according to the encoding specified by the 
> charset key into a string containing UTF-8 octets and use 
> Streams.Stream_IO to write these octets as-is. The conversion for the case 
> of windows-1251 I described earlier. Create a table Character'Pos 
> 0..255 -> Code_Point and use it for each "character" of HTML_Result.
>
> P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the 
> underlying OS.
>
> P.P.S. Technically AWS also ignores Ada standard. But that is an 
> established practice. Since there is no better way.

Right. Probably the easiest way to do this (using just Ada functions) would 
be to:

 (A)  Use Ada.Characters to convert the To_String of the unbounded string to 
a Wide_String, and then store that in a Wide_Unbounded_String (or is that a 
Unbounded_Wide_String?)
 (B) Use Ada.Strings.Wide_Maps to create a character conversion map (the 
conversions were described by another reply);
 (C) Use Ada.Strings.Wide_Unbounded.Translate to apply the mapping from (B) 
to your Wide_Unbounded_String.
(D) Use Ada.Strings.UTF_Encoding.Wide_Strings.Encode to convert 
To_Wide_String to your translated Wide_Unbounded_String, presumably storing 
the result into a Unbounded_String.

You potentially could skip (D) if Wide_Text_IO works when sent to 
Standard_Output (I'd expect that on Windows, no idea on Linux). In that 
case, use Wide_Text_IO.Put to send your result.

In any case, this shows why Unicode exists, and why anything these days that 
uses non-standard encodings is evil. There's really no short-cut to recoding 
such things, and that makes them maddening.

                                  Randy.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-10-31 15:28 ` eduardsapotski
  2018-10-31 16:50   ` Shark8
  2018-10-31 17:01   ` Dmitry A. Kazakov
@ 2018-11-01 12:49   ` Björn Lundin
  2018-11-01 13:26     ` Dmitry A. Kazakov
  2 siblings, 1 reply; 11+ messages in thread
From: Björn Lundin @ 2018-11-01 12:49 UTC (permalink / raw)


On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
> Let's make it easier. For example:
> 
> ------------------------------------------------------------------
> 
> with Ada.Strings.Unbounded;     use Ada.Strings.Unbounded;
> with Ada.Text_IO.Unbounded_IO;  use Ada.Text_IO.Unbounded_IO;
> 
> with AWS.Client;            use AWS.Client;
> with AWS.Messages;          use AWS.Messages;
> with AWS.Response;          use AWS.Response;
> 
> procedure Main is
> 
>    HTML_Result   : Unbounded_String;
>    Request_Header_List : Header_List;
> 
> begin
> 
>    Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
> 
>    HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List));
> 
>    Put_Line(HTML_Result);
> 
> end Main;
> 
> ------------------------------------------------------------------
> 
> My linux terminal (default UTF-8) show: https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
> 
> If set encoding in terminal Windows-1251 - all is well: https://photos.app.goo.gl/goN5g7uofD8rYLP79
> 
> Are there standard ways to solve this problem?
> 


In xml/ada there are unicode packages.

something like (with changes for 1251 instead of Latin_1 to be done)

with Unicode.Ces.Utf8, Unicode.Ces.Utf32, Unicode.Ces.Basic_8bit,
Unicode.Ccs.ISO_8859_1;
use Unicode, Unicode.Ccs, Unicode.Ces, Unicode.Ces.Utf8, Unicode.Ces.Utf32;

--some with are likely not needed, code copied from bigger function


 function To_Utf_8_From_Latin_1_Little_Endian
     (A_Latin_1_Encoded_String : in String)
      return String is

    --  32-bit Latin-1 string (normal Ada string with 32-bit characters)
    S_32 : Unicode.Ces.Utf32.Utf32_Le_String :=
       Unicode.Ces.Basic_8bit.To_Utf32 (A_Latin_1_Encoded_String);

    --  UTF-32 string (convert Latin-1 to Unicode characters)
    U_32 : Unicode.Ces.Utf32.Utf32_Le_String :=
       Unicode.Ces.Utf32.To_Unicode_Le
          (S_32,
           Cs => Unicode.Ccs.ISO_8859_1.ISO_8859_1_Character_Set);
    -- change UTF-32 to UTF-8
    An_Utf_8_Encoded_String_Le : Unicode.Ces.Utf8.Utf8_String :=
Unicode.Ces.Utf8.From_Utf32 (U_32);

  begin
    return An_Utf_8_Encoded_String_Le;
  end To_Utf_8_From_Latin_1_Little_Endian;

---------------------------------------------------------------------------------


It's a starting point

-- 
--
Björn

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-11-01 12:49   ` Björn Lundin
@ 2018-11-01 13:26     ` Dmitry A. Kazakov
  2018-11-01 14:34       ` Björn Lundin
  0 siblings, 1 reply; 11+ messages in thread
From: Dmitry A. Kazakov @ 2018-11-01 13:26 UTC (permalink / raw)


On 2018-11-01 13:49, Björn Lundin wrote:

> something like (with changes for 1251 instead of Latin_1 to be done)

You probably mean 1252 which almost Latin-1. 1251 is totally different. 
it has Cyrillic letters in the upper half of 8-bit codes, in the place 
where 1252 keeps Central European letters with fancy diacritic marks.

Maybe I will add 1251 and 1252 in the next release of Strings editing 
library.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-11-01 13:26     ` Dmitry A. Kazakov
@ 2018-11-01 14:34       ` Björn Lundin
  0 siblings, 0 replies; 11+ messages in thread
From: Björn Lundin @ 2018-11-01 14:34 UTC (permalink / raw)


On 2018-11-01 14:26, Dmitry A. Kazakov wrote:
> On 2018-11-01 13:49, Björn Lundin wrote:
> 
>> something like (with changes for 1251 instead of Latin_1 to be done)
> 
> You probably mean 1252 which almost Latin-1. 

I do.


> 1251 is totally different.
> it has Cyrillic letters in the upper half of 8-bit codes, in the place
> where 1252 keeps Central European letters with fancy diacritic marks.

And I also found that the code in last post can be replaced by

  -------------------------------------------------------
  function To_Iso_Latin_15(Str : Unicode.CES.Byte_Sequence) return String is
    use Unicode.Encodings;
  begin
    return  Convert(Str  => Str,
                    From => Get_By_Name("utf-8"),
                    To   => Get_By_Name("iso-8859-15"));

  end To_Iso_Latin_15;
  -------------------------------------------------------

I also see that the unicode package in xml/ada has support for
1251 and 1252.

package Unicode.CCS.Windows_1251 is ...

the withs are
with Ada.Exceptions;                   use Ada.Exceptions;
with Unicode.Names.Cyrillic;           use Unicode.Names.Cyrillic;
with Unicode.Names.Basic_Latin;        use Unicode.Names.Basic_Latin;
with Unicode.Names.Latin_1_Supplement; use Unicode.Names.Latin_1_Supplement;
with Unicode.Names.Currency_Symbols;   use Unicode.Names.Currency_Symbols;
with Unicode.Names.General_Punctuation;
use Unicode.Names.General_Punctuation;
with Unicode.Names.Letterlike_Symbols;
use Unicode.Names.Letterlike_Symbols;



which suggests to me that it is the cyrillic one


which (I think) would make the function above


-------------------------------------------------------
  function To_Windows_1251(Str : Unicode.CES.Byte_Sequence) return String is
    use Unicode.Encodings;
  begin
    return  Convert(Str  => Str,
                    From => Get_By_Name("utf-8"),
                    To   => Get_By_Name("Windows-1251"));

  end To_Windows_1251;
  -------------------------------------------------------



-- 
--
Björn

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: windows-1251 to utf-8
  2018-10-31  2:57 windows-1251 to utf-8 eduardsapotski
                   ` (2 preceding siblings ...)
  2018-10-31 15:28 ` eduardsapotski
@ 2018-11-01 18:14 ` Vadim Godunko
  3 siblings, 0 replies; 11+ messages in thread
From: Vadim Godunko @ 2018-11-01 18:14 UTC (permalink / raw)


You can use Matreshka's text codecs, here is example.

with Ada.Text_IO;        use Ada.Text_IO;

with AWS.Client;         use AWS.Client;
with AWS.Response;       use AWS.Response;

with League.Strings;     use League.Strings;
with League.Text_Codecs; use League.Text_Codecs;

procedure Main is
   Request_Header_List : Header_List;
   CP1251_Codec        : Text_Codec := Codec (To_Universal_String ("cp1251"));
   Text                : Universal_String;

begin

   Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");

   Text := CP1251_Codec.Decode (Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List)));

   Put_Line(Text.To_UTF_8_String);

end Main;

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-11-01 18:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-31  2:57 windows-1251 to utf-8 eduardsapotski
2018-10-31  6:09 ` gautier_niouzes
2018-10-31 10:01 ` Dmitry A. Kazakov
2018-10-31 15:28 ` eduardsapotski
2018-10-31 16:50   ` Shark8
2018-10-31 17:01   ` Dmitry A. Kazakov
2018-10-31 20:58     ` Randy Brukardt
2018-11-01 12:49   ` Björn Lundin
2018-11-01 13:26     ` Dmitry A. Kazakov
2018-11-01 14:34       ` Björn Lundin
2018-11-01 18:14 ` Vadim Godunko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox