From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!nntp-feed.chiark.greenend.org.uk!ewrotcd!newsfeed.xs3.de!io.xs3.de!news.jacob-sparre.dk!franka.jacob-sparre.dk!pnx.dk!.POSTED.rrsoftware.com!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: windows-1251 to utf-8 Date: Wed, 31 Oct 2018 15:58:21 -0500 Organization: JSA Research & Innovation Message-ID: References: <74537c7a-18dd-421a-b3c2-6919285006cd@googlegroups.com> Injection-Date: Wed, 31 Oct 2018 20:58:22 -0000 (UTC) Injection-Info: franka.jacob-sparre.dk; posting-host="rrsoftware.com:24.196.82.226"; logging-data="29475"; mail-complaints-to="news@jacob-sparre.dk" X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Response X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246 Xref: reader02.eternal-september.org comp.lang.ada:54735 Date: 2018-10-31T15:58:21-05:00 List-Id: >Dmitry A. Kazakov" wrote in message >news:prcn4v$d30$1@gioia.aioe.org... > On 2018-10-31 16:28, eduardsapotski@gmail.com wrote: >> Let's make it easier. For example: >> >> ------------------------------------------------------------------ >> >> with Ada.Strings.Unbounded; use Ada.Strings.Unbounded; >> with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO; >> >> with AWS.Client; use AWS.Client; >> with AWS.Messages; use AWS.Messages; >> with AWS.Response; use AWS.Response; >> >> procedure Main is >> >> HTML_Result : Unbounded_String; >> Request_Header_List : Header_List; >> >> begin >> >> Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 >> (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0"); >> >> HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers >> => Request_Header_List)); >> >> Put_Line(HTML_Result); >> >> end Main; >> >> ------------------------------------------------------------------ >> >> My linux terminal (default UTF-8) show: >> https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA >> >> If set encoding in terminal Windows-1251 - all is well: >> https://photos.app.goo.gl/goN5g7uofD8rYLP79 >> >> Are there standard ways to solve this problem? > > What problem? The page uses the content charset=windows-1251. It is legal. > > Your program is illegal as it prints the body using Put_Line. Ada standard > requires Character be Latin-1. The only case when your program would be > correct is when charset=ISO-8859-1. > > You must convert the page body according to the encoding specified by the > charset key into a string containing UTF-8 octets and use > Streams.Stream_IO to write these octets as-is. The conversion for the case > of windows-1251 I described earlier. Create a table Character'Pos > 0..255 -> Code_Point and use it for each "character" of HTML_Result. > > P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the > underlying OS. > > P.P.S. Technically AWS also ignores Ada standard. But that is an > established practice. Since there is no better way. Right. Probably the easiest way to do this (using just Ada functions) would be to: (A) Use Ada.Characters to convert the To_String of the unbounded string to a Wide_String, and then store that in a Wide_Unbounded_String (or is that a Unbounded_Wide_String?) (B) Use Ada.Strings.Wide_Maps to create a character conversion map (the conversions were described by another reply); (C) Use Ada.Strings.Wide_Unbounded.Translate to apply the mapping from (B) to your Wide_Unbounded_String. (D) Use Ada.Strings.UTF_Encoding.Wide_Strings.Encode to convert To_Wide_String to your translated Wide_Unbounded_String, presumably storing the result into a Unbounded_String. You potentially could skip (D) if Wide_Text_IO works when sent to Standard_Output (I'd expect that on Windows, no idea on Linux). In that case, use Wide_Text_IO.Put to send your result. In any case, this shows why Unicode exists, and why anything these days that uses non-standard encodings is evil. There's really no short-cut to recoding such things, and that makes them maddening. Randy.