* windows-1251 to utf-8
@ 2018-10-31 2:57 eduardsapotski
2018-10-31 6:09 ` gautier_niouzes
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: eduardsapotski @ 2018-10-31 2:57 UTC (permalink / raw)
I get HTML from web-server in windows-1251 encoding.
How do convert HTML in windows-1251 to utf-8?
Thank.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-10-31 2:57 windows-1251 to utf-8 eduardsapotski
@ 2018-10-31 6:09 ` gautier_niouzes
2018-10-31 10:01 ` Dmitry A. Kazakov
` (2 subsequent siblings)
3 siblings, 0 replies; 11+ messages in thread
From: gautier_niouzes @ 2018-10-31 6:09 UTC (permalink / raw)
Have a look here:
https://sf.net/p/wasabee/code/HEAD/tree/zrt_dev/common/wasabee-encoding.adb
HTH
G.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-10-31 2:57 windows-1251 to utf-8 eduardsapotski
2018-10-31 6:09 ` gautier_niouzes
@ 2018-10-31 10:01 ` Dmitry A. Kazakov
2018-10-31 15:28 ` eduardsapotski
2018-11-01 18:14 ` Vadim Godunko
3 siblings, 0 replies; 11+ messages in thread
From: Dmitry A. Kazakov @ 2018-10-31 10:01 UTC (permalink / raw)
On 2018-10-31 03:57, eduardsapotski@gmail.com wrote:
> I get HTML from web-server in windows-1251 encoding.
> How do convert HTML in windows-1251 to utf-8?
The encoding table is this:
https://en.wikipedia.org/wiki/Windows-1251
The 7-bit codes correspond to UTF-8 directly. For 8-bit codes (for all
codes actually) you take the number from the table, e.g. Cyrillic
capital Ц -> 16#0426# and convert it to UTF-8 sequence using, for
example this:
http://www.dmitry-kazakov.de/ada/strings_edit.htm#7
The function Strings_Edit.UTF8.Image takes code point and returns UTF-8
equivalent, so
Strings_Edit.UTF8.Image (16#0426#)
gives Ц in UTF-8.
HTML is an unrelated story. Do you mean RFC 2396 escape sequences? This
is an alternative representation that has nothing to do with Windows-1251.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-10-31 2:57 windows-1251 to utf-8 eduardsapotski
2018-10-31 6:09 ` gautier_niouzes
2018-10-31 10:01 ` Dmitry A. Kazakov
@ 2018-10-31 15:28 ` eduardsapotski
2018-10-31 16:50 ` Shark8
` (2 more replies)
2018-11-01 18:14 ` Vadim Godunko
3 siblings, 3 replies; 11+ messages in thread
From: eduardsapotski @ 2018-10-31 15:28 UTC (permalink / raw)
Let's make it easier. For example:
------------------------------------------------------------------
with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;
with AWS.Client; use AWS.Client;
with AWS.Messages; use AWS.Messages;
with AWS.Response; use AWS.Response;
procedure Main is
HTML_Result : Unbounded_String;
Request_Header_List : Header_List;
begin
Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List));
Put_Line(HTML_Result);
end Main;
------------------------------------------------------------------
My linux terminal (default UTF-8) show: https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
If set encoding in terminal Windows-1251 - all is well: https://photos.app.goo.gl/goN5g7uofD8rYLP79
Are there standard ways to solve this problem?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-10-31 15:28 ` eduardsapotski
@ 2018-10-31 16:50 ` Shark8
2018-10-31 17:01 ` Dmitry A. Kazakov
2018-11-01 12:49 ` Björn Lundin
2 siblings, 0 replies; 11+ messages in thread
From: Shark8 @ 2018-10-31 16:50 UTC (permalink / raw)
> Are there standard ways to solve this problem?
I *think* you can use Character-mapping to translate from Windows-1251 to UTF-X... although I'm unsure if it has to be the same character-size.
Failing that, maybe Matreshka -- http://forge.ada-ru.org/matreshka -- has something for it. I haven't used Matreshka [yet] but there's supposedly a big Unicode/manipulation library in it.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-10-31 15:28 ` eduardsapotski
2018-10-31 16:50 ` Shark8
@ 2018-10-31 17:01 ` Dmitry A. Kazakov
2018-10-31 20:58 ` Randy Brukardt
2018-11-01 12:49 ` Björn Lundin
2 siblings, 1 reply; 11+ messages in thread
From: Dmitry A. Kazakov @ 2018-10-31 17:01 UTC (permalink / raw)
On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
> Let's make it easier. For example:
>
> ------------------------------------------------------------------
>
> with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
> with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;
>
> with AWS.Client; use AWS.Client;
> with AWS.Messages; use AWS.Messages;
> with AWS.Response; use AWS.Response;
>
> procedure Main is
>
> HTML_Result : Unbounded_String;
> Request_Header_List : Header_List;
>
> begin
>
> Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
>
> HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List));
>
> Put_Line(HTML_Result);
>
> end Main;
>
> ------------------------------------------------------------------
>
> My linux terminal (default UTF-8) show: https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
>
> If set encoding in terminal Windows-1251 - all is well: https://photos.app.goo.gl/goN5g7uofD8rYLP79
>
> Are there standard ways to solve this problem?
What problem? The page uses the content charset=windows-1251. It is legal.
Your program is illegal as it prints the body using Put_Line. Ada
standard requires Character be Latin-1. The only case when your program
would be correct is when charset=ISO-8859-1.
You must convert the page body according to the encoding specified by
the charset key into a string containing UTF-8 octets and use
Streams.Stream_IO to write these octets as-is. The conversion for the
case of windows-1251 I described earlier. Create a table Character'Pos
0..255 -> Code_Point and use it for each "character" of HTML_Result.
P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the
underlying OS.
P.P.S. Technically AWS also ignores Ada standard. But that is an
established practice. Since there is no better way.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-10-31 17:01 ` Dmitry A. Kazakov
@ 2018-10-31 20:58 ` Randy Brukardt
0 siblings, 0 replies; 11+ messages in thread
From: Randy Brukardt @ 2018-10-31 20:58 UTC (permalink / raw)
>Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
>news:prcn4v$d30$1@gioia.aioe.org...
> On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
>> Let's make it easier. For example:
>>
>> ------------------------------------------------------------------
>>
>> with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
>> with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;
>>
>> with AWS.Client; use AWS.Client;
>> with AWS.Messages; use AWS.Messages;
>> with AWS.Response; use AWS.Response;
>>
>> procedure Main is
>>
>> HTML_Result : Unbounded_String;
>> Request_Header_List : Header_List;
>>
>> begin
>>
>> Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0
>> (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
>>
>> HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers
>> => Request_Header_List));
>>
>> Put_Line(HTML_Result);
>>
>> end Main;
>>
>> ------------------------------------------------------------------
>>
>> My linux terminal (default UTF-8) show:
>> https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
>>
>> If set encoding in terminal Windows-1251 - all is well:
>> https://photos.app.goo.gl/goN5g7uofD8rYLP79
>>
>> Are there standard ways to solve this problem?
>
> What problem? The page uses the content charset=windows-1251. It is legal.
>
> Your program is illegal as it prints the body using Put_Line. Ada standard
> requires Character be Latin-1. The only case when your program would be
> correct is when charset=ISO-8859-1.
>
> You must convert the page body according to the encoding specified by the
> charset key into a string containing UTF-8 octets and use
> Streams.Stream_IO to write these octets as-is. The conversion for the case
> of windows-1251 I described earlier. Create a table Character'Pos
> 0..255 -> Code_Point and use it for each "character" of HTML_Result.
>
> P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the
> underlying OS.
>
> P.P.S. Technically AWS also ignores Ada standard. But that is an
> established practice. Since there is no better way.
Right. Probably the easiest way to do this (using just Ada functions) would
be to:
(A) Use Ada.Characters to convert the To_String of the unbounded string to
a Wide_String, and then store that in a Wide_Unbounded_String (or is that a
Unbounded_Wide_String?)
(B) Use Ada.Strings.Wide_Maps to create a character conversion map (the
conversions were described by another reply);
(C) Use Ada.Strings.Wide_Unbounded.Translate to apply the mapping from (B)
to your Wide_Unbounded_String.
(D) Use Ada.Strings.UTF_Encoding.Wide_Strings.Encode to convert
To_Wide_String to your translated Wide_Unbounded_String, presumably storing
the result into a Unbounded_String.
You potentially could skip (D) if Wide_Text_IO works when sent to
Standard_Output (I'd expect that on Windows, no idea on Linux). In that
case, use Wide_Text_IO.Put to send your result.
In any case, this shows why Unicode exists, and why anything these days that
uses non-standard encodings is evil. There's really no short-cut to recoding
such things, and that makes them maddening.
Randy.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-10-31 15:28 ` eduardsapotski
2018-10-31 16:50 ` Shark8
2018-10-31 17:01 ` Dmitry A. Kazakov
@ 2018-11-01 12:49 ` Björn Lundin
2018-11-01 13:26 ` Dmitry A. Kazakov
2 siblings, 1 reply; 11+ messages in thread
From: Björn Lundin @ 2018-11-01 12:49 UTC (permalink / raw)
On 2018-10-31 16:28, eduardsapotski@gmail.com wrote:
> Let's make it easier. For example:
>
> ------------------------------------------------------------------
>
> with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
> with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;
>
> with AWS.Client; use AWS.Client;
> with AWS.Messages; use AWS.Messages;
> with AWS.Response; use AWS.Response;
>
> procedure Main is
>
> HTML_Result : Unbounded_String;
> Request_Header_List : Header_List;
>
> begin
>
> Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
>
> HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List));
>
> Put_Line(HTML_Result);
>
> end Main;
>
> ------------------------------------------------------------------
>
> My linux terminal (default UTF-8) show: https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
>
> If set encoding in terminal Windows-1251 - all is well: https://photos.app.goo.gl/goN5g7uofD8rYLP79
>
> Are there standard ways to solve this problem?
>
In xml/ada there are unicode packages.
something like (with changes for 1251 instead of Latin_1 to be done)
with Unicode.Ces.Utf8, Unicode.Ces.Utf32, Unicode.Ces.Basic_8bit,
Unicode.Ccs.ISO_8859_1;
use Unicode, Unicode.Ccs, Unicode.Ces, Unicode.Ces.Utf8, Unicode.Ces.Utf32;
--some with are likely not needed, code copied from bigger function
function To_Utf_8_From_Latin_1_Little_Endian
(A_Latin_1_Encoded_String : in String)
return String is
-- 32-bit Latin-1 string (normal Ada string with 32-bit characters)
S_32 : Unicode.Ces.Utf32.Utf32_Le_String :=
Unicode.Ces.Basic_8bit.To_Utf32 (A_Latin_1_Encoded_String);
-- UTF-32 string (convert Latin-1 to Unicode characters)
U_32 : Unicode.Ces.Utf32.Utf32_Le_String :=
Unicode.Ces.Utf32.To_Unicode_Le
(S_32,
Cs => Unicode.Ccs.ISO_8859_1.ISO_8859_1_Character_Set);
-- change UTF-32 to UTF-8
An_Utf_8_Encoded_String_Le : Unicode.Ces.Utf8.Utf8_String :=
Unicode.Ces.Utf8.From_Utf32 (U_32);
begin
return An_Utf_8_Encoded_String_Le;
end To_Utf_8_From_Latin_1_Little_Endian;
---------------------------------------------------------------------------------
It's a starting point
--
--
Björn
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-11-01 12:49 ` Björn Lundin
@ 2018-11-01 13:26 ` Dmitry A. Kazakov
2018-11-01 14:34 ` Björn Lundin
0 siblings, 1 reply; 11+ messages in thread
From: Dmitry A. Kazakov @ 2018-11-01 13:26 UTC (permalink / raw)
On 2018-11-01 13:49, Björn Lundin wrote:
> something like (with changes for 1251 instead of Latin_1 to be done)
You probably mean 1252 which almost Latin-1. 1251 is totally different.
it has Cyrillic letters in the upper half of 8-bit codes, in the place
where 1252 keeps Central European letters with fancy diacritic marks.
Maybe I will add 1251 and 1252 in the next release of Strings editing
library.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-11-01 13:26 ` Dmitry A. Kazakov
@ 2018-11-01 14:34 ` Björn Lundin
0 siblings, 0 replies; 11+ messages in thread
From: Björn Lundin @ 2018-11-01 14:34 UTC (permalink / raw)
On 2018-11-01 14:26, Dmitry A. Kazakov wrote:
> On 2018-11-01 13:49, Björn Lundin wrote:
>
>> something like (with changes for 1251 instead of Latin_1 to be done)
>
> You probably mean 1252 which almost Latin-1.
I do.
> 1251 is totally different.
> it has Cyrillic letters in the upper half of 8-bit codes, in the place
> where 1252 keeps Central European letters with fancy diacritic marks.
And I also found that the code in last post can be replaced by
-------------------------------------------------------
function To_Iso_Latin_15(Str : Unicode.CES.Byte_Sequence) return String is
use Unicode.Encodings;
begin
return Convert(Str => Str,
From => Get_By_Name("utf-8"),
To => Get_By_Name("iso-8859-15"));
end To_Iso_Latin_15;
-------------------------------------------------------
I also see that the unicode package in xml/ada has support for
1251 and 1252.
package Unicode.CCS.Windows_1251 is ...
the withs are
with Ada.Exceptions; use Ada.Exceptions;
with Unicode.Names.Cyrillic; use Unicode.Names.Cyrillic;
with Unicode.Names.Basic_Latin; use Unicode.Names.Basic_Latin;
with Unicode.Names.Latin_1_Supplement; use Unicode.Names.Latin_1_Supplement;
with Unicode.Names.Currency_Symbols; use Unicode.Names.Currency_Symbols;
with Unicode.Names.General_Punctuation;
use Unicode.Names.General_Punctuation;
with Unicode.Names.Letterlike_Symbols;
use Unicode.Names.Letterlike_Symbols;
which suggests to me that it is the cyrillic one
which (I think) would make the function above
-------------------------------------------------------
function To_Windows_1251(Str : Unicode.CES.Byte_Sequence) return String is
use Unicode.Encodings;
begin
return Convert(Str => Str,
From => Get_By_Name("utf-8"),
To => Get_By_Name("Windows-1251"));
end To_Windows_1251;
-------------------------------------------------------
--
--
Björn
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: windows-1251 to utf-8
2018-10-31 2:57 windows-1251 to utf-8 eduardsapotski
` (2 preceding siblings ...)
2018-10-31 15:28 ` eduardsapotski
@ 2018-11-01 18:14 ` Vadim Godunko
3 siblings, 0 replies; 11+ messages in thread
From: Vadim Godunko @ 2018-11-01 18:14 UTC (permalink / raw)
You can use Matreshka's text codecs, here is example.
with Ada.Text_IO; use Ada.Text_IO;
with AWS.Client; use AWS.Client;
with AWS.Response; use AWS.Response;
with League.Strings; use League.Strings;
with League.Text_Codecs; use League.Text_Codecs;
procedure Main is
Request_Header_List : Header_List;
CP1251_Codec : Text_Codec := Codec (To_Universal_String ("cp1251"));
Text : Universal_String;
begin
Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
Text := CP1251_Codec.Decode (Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List)));
Put_Line(Text.To_UTF_8_String);
end Main;
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-11-01 18:14 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-31 2:57 windows-1251 to utf-8 eduardsapotski
2018-10-31 6:09 ` gautier_niouzes
2018-10-31 10:01 ` Dmitry A. Kazakov
2018-10-31 15:28 ` eduardsapotski
2018-10-31 16:50 ` Shark8
2018-10-31 17:01 ` Dmitry A. Kazakov
2018-10-31 20:58 ` Randy Brukardt
2018-11-01 12:49 ` Björn Lundin
2018-11-01 13:26 ` Dmitry A. Kazakov
2018-11-01 14:34 ` Björn Lundin
2018-11-01 18:14 ` Vadim Godunko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox