From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-25) on ip-172-31-91-241.ec2.internal X-Spam-Level: X-Spam-Status: No, score=0.0 required=3.0 tests=none autolearn=ham autolearn_force=no version=4.0.1 Path: nntp.eternal-september.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Nicolas Paul Colin de Glocester Newsgroups: comp.lang.ada,fr.comp.lang.ada Subject: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling) Date: Sun, 31 Aug 2025 19:39:56 +0200 Organization: A noiseless patient Spider Message-ID: References: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="708268602-391362420-1756662000=:706029" Injection-Date: Sun, 31 Aug 2025 17:40:01 +0000 (UTC) Injection-Info: dont-email.me; posting-host="d843223f53f72031014199e0cf1f7cba"; logging-data="3548764"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185HSY5FneKKojCkLo4QVf+kvb3tXpw6hhliMOn11v8TQ==" Cancel-Lock: sha1:8lZqCTJPVzPp//M9C3SQmyYvG34= In-Reply-To: Xref: feeder.eternal-september.org comp.lang.ada:67011 fr.comp.lang.ada:2349 List-Id: This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --708268602-391362420-1756662000=:706029 Content-Type: text/plain; format=flowed; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Dear Adaists, Bj=C3=B6rn Persson wrote during 2006: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $"Gnat's approach to character encodings is$ $amazingly faulty." $ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ Bj=C3=B6rn Persson wrote during 2006: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $"> System.WCh_Cnv confound JIS character code with Unicode, it makes $ $> troubles. Wide_Text_IO (and -gnatWs, -gantWe) are useless in fact, $ $> because there is no what uses JIS character code as it is, conversion$ $> is needed after all. $ $ $ $I haven't used that package myself so I don't know how it works, but I $ $won't be surprised if it's buggy. In my experience, Adacore's handling $ $of character encodings is rather unimpressive." $ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ Deadly Head wrote during 2010: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %"This is a pretty big deal to me. For a long time I've been a bit... % %frustrated? ... by the fact that the Ada standard specifically gives % %us Wide_ and Wide_Wide_Characters and their associated strings, but % %actually _using_ them seemed pretty much worthless. I mean, if you % %can't actually _talk_ with them to a modern system (UTF-8 or UTF-16 % %encoding seems to be pretty much the way it goes), what's the point in% %using them? % % % %So I'm pretty happy with using either the WCEM=3D8 or -gnatW8 methods of% %setting the encoding to get UTF-8 input and output. What I'm % %wondering now is can I get other UTF outputs to work? % % % %I actually have the peculiar case of dealing with UTF-32 encoded % %files, which need to be translated to UTF-8 for editing, and back to % %UTF-32 for machine-use again. It seems that it would be pretty % %straight-forward to just pull the file in with a straight % %Wide_Wide_Text_IO.Open/Get_Line system, then output via % %Wide_Wide_Text_IO.Put on a file where Form =3D> "WCEM=3D8". So far, = % %though, I'm having trouble [. . .]" % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Ludovic Brenta wrote during 2014: |-------------------------------------------------------------------------| |"As for the text that your program must process, that's really up to you.| |Ada 95 added the Wide_Character and Wide_String to help you use 16-bit | |characters (not exactly UTF-16, rather supporting only the first plane | |of the Unicode character set); Ada 2005 added Wide_Wide_Character for | |32-bit characters (i.e. UTF-32 encoding) The String Encoding package is | |there to help you transcode text between 8-bit Latin_1, UTF-8, proper | |UTF-16 and UTF-32. The new packages are there to help you but they | |don't do anything that wasn't possible in previous versions of Ada | |(i.e. you could reimplement them in Ada 95 if you so wished)." | |-------------------------------------------------------------------------| Yannick Duch=C3=AAne (Hibou57) wrote during 2010: ###########################################################################= ### #"Extract from the thread =E2=80=9CS-expression I/O in Ada=E2=80=9D. Subtop= ic moved in a # #separate thread for clarity. = # # = # #Le Wed, 18 Aug 2010 15:16:50 +0200, J-P. Rosen a =C3=A9c= rit: # #> Slightly OT, but you (and others) might be interested to know that Ada = # #> 2012 will include string encoding packages to the various UTF-X = # #> encodings. These will be (are?) provided very soon by GNAT. = # #> = # #> See AI05-137-2 = # #> (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=3D= 1.2)# # = # #Time for my stupid question of the day :) = # # = # #I've noticed this introduction in the last amendment, because Unicode has = # #always been an issue/matter for me (actually use my own). = # # = # #I could not avoid two questions: why no UTF-32 ? (this would not be an = # #implementation nightmare) and why BOM handled for each string while BOM is= # #to be used at stream/file level ? (see XML or HTML files for example). Or = # #are these strings supposed to hold the whole content of a file/stream ? = # # = # #Quote: = # #http://www.unicode.org/faq/utf_bom.html = # #> A: A byte order mark (BOM) consists of the character code U+FEFF at the = # #> beginning of a data stream = # # = # #This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML = # #reference, HTTML reference) all says the same. = # # = # #This matter, because the code point U+FEFF can stands for two different = # #things: Zero Width No Break Space or encoding Byte Order Mark. The only = # #way to distinguish both usage, is where-it-appears. = # # = # #If it appears as the first code point of a stream, this is a BOM = # #(heuristics may be applied to automatically switch encoding with an = # #analysis of the first byte of a stream, this is what I do) ; if this = # #appears any where else in a stream, this is a character code point." = # ###########################################################################= ### Contrarily to =E2=80=9CAda 2012 will include string encoding packages to th= e=20 various UTF-X encodings=E2=80=9D, a standard Ada package does not support U= TF-32!=20 Even Ada 2022 lacks! "Table 23-6. Unicode Encoding Scheme Signatures Encoding Scheme=09Signature UTF-8=09EF BB BF UTF-16 Big-endian=09FE FF UTF-16 Little-endian=09FF FE UTF-32 Big-endian=0900 00 FE FF UTF-32 Little-endian=09FF FE 00 00" says HTTPS://WWW.Unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G19635 iconv --list reports many kinds: "UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE,=20 UCS2, UCS4," and "UNICODE, UNICODEBIG, UNICODELITTLE," and "UTF-7-IMAP,=20 UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE,=20 UTF7, UTF8, UTF16, UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE". "package Ada.Strings.UTF_Encoding with Pure is 4/3 -- Declarations common to the string encoding packages type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE); 5/3 subtype UTF_String is String; 6/3 subtype UTF_8_String is String; 7/3 subtype UTF_16_Wide_String is Wide_String; 8/3 Encoding_Error : exception; 9/3 BOM_8 : constant UTF_8_String :=3D Character'Val(16#EF#) & Character'Val(16#BB#) & Character'Val(16#BF#); 10/3 BOM_16BE : constant UTF_String :=3D Character'Val(16#FE#) & Character'Val(16#FF#); 11/3 BOM_16LE : constant UTF_String :=3D Character'Val(16#FF#) & Character'Val(16#FE#); 12/3 BOM_16 : constant UTF_16_Wide_String :=3D (1 =3D> Wide_Character'Val(16#FEFF#));" says HTTPS://AdaIC.org/resources/add_content/standards/22rm/html/RM-A-4-11.html without UTF-32. John or Erich Rast wrote during 2014: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^"there are plenty of converters between different Unicode versions^ ^(UTF-8, UTF-16, UTF-32)." ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Contrast with "package Ada.Strings.UTF_Encoding with Pure is 4/3 -- Declarations common to the string encoding packages type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE); [. . .] end Ada.Strings.UTF_Encoding; 15/5 package Ada.Strings.UTF_Encoding.Conversions with Pure is 16/3 -- Conversions between various encoding schemes function Convert (Item : UTF_String; Input_Scheme : Encoding_Scheme; Output_Scheme : Encoding_Scheme; Output_BOM : Boolean :=3D False) return UTF_String= ;" says HTTPS://AdaIC.org/resources/add_content/standards/22rm/html/RM-A-4-11.html "A full featured character encoding converter will have to provide the=20 following 13 encoding variants of Unicode and UCS: UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE,= =20 UTF-16LE, UTF-32, UTF-32BE, UTF-32LE" says HTTPS://WWW.CL.Cam.ac.UK/~mgk25/unicode.html (The same webpage says: "The term UTF-32 was introduced in Unicode to describe a 4-byte encoding=20 of the extended =E2=80=9C21-bit=E2=80=9D Unicode. UTF-32 is the exact same = thing as UCS-4,=20 except that by definition UTF-32 is never used to represent characters=20 above U-0010FFFF, while UCS-4 can cover all 2[**]31 code positions up to=20 U-7FFFFFFF." Contrast with: "UCS-4 stands for =E2=80=9CUniversal Character Set coded in 4 octets.=E2=80= =9D It is now=20 treated simply as a synonym for UTF-32, and is considered the canonical=20 form for representation of characters in 10646." says HTTPS://WWW.Unicode.org/versions/Unicode16.0.0/core-spec/appendix-c So much for standardisation!) Randy L. Brukardt wrote during 2017: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>= >> >"In Ada, type Character =3D Latin-1 =3D first 255 code positions, 8-bit = > >representation. Text_IO and type String are for Latin-1 strings. = > > = > >type Wide_Charater =3D BMP (Basic Multilingual Plane) =3D first 65535 code= > >positions =3D UCS-2 =3D 16-bit representation. = > > = > >type Wide_Wide_Character =3D all of Unicode =3D UCS-4 =3D 32-bit represent= ation. > > = > >There is no native support in Ada for UTF-8 or UTF-16 strings. There is a = > >conversion package (Ada.Strings.Encoding) [which is nasty because it break= s> >strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO an= d> >Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 = > >(there is no good way to tell between them in the general case). = > > = > >Windows uses a BOM character at the start of UTF-8 files to differentiate = > >(at least in programs like Notepad and the built-in edit control), but tha= t> >is not recommended by Unicode. I think they would prefer a world where = > >Latin-1 had disappeared completely, but that of course is not the real = > >world." = > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>= >> Luke A. Guest wrote during 2021: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !"And this is there the Ada standard gets it wrong, in the encodings! !package re utf-8." ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Vadim Godunko wrote during 2021: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <"Ada doesn't have good Unicode support. :( So, you need to find suitable< >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >"Right. The proper thing to do (for Ada 2012) is to use > >Ada.Characters.Wide_Handling (or Wide_Wide_Handling) to do the case> >conversion, after converting the UTF-8 into a Wide_String (or > >Wide_Wide_String)." > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> However, Dmitry A. Kazakov wrote during 2021: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !"Never ever use ! !Wide or Wide_Wide, they are useless."! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Vadim Godunko wrote during 2022: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <"I think ((Wide_)Wide_)(Character|String) is obsolete for modern <