Dear Adaists,

Björn Persson wrote during 2006:
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$"Gnat's approach to character encodings is$
$amazingly faulty."                        $
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Björn Persson wrote during 2006:
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$"> System.WCh_Cnv confound JIS character code with Unicode, it makes   $
$> troubles. Wide_Text_IO (and -gnatWs, -gantWe) are useless in fact,   $
$> because there is no what uses JIS character code as it is, conversion$
$> is needed after all.                                                 $
$                                                                       $
$I haven't used that package myself so I don't know how it works, but I $
$won't be surprised if it's buggy. In my experience, Adacore's handling $
$of character encodings is rather unimpressive."                        $
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Deadly Head wrote during 2010:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%"This is a pretty big deal to me.  For a long time I've been a bit... %
%frustrated? ... by the fact that the Ada standard specifically gives  %
%us Wide_ and Wide_Wide_Characters and their associated strings, but   %
%actually _using_ them seemed pretty much worthless.  I mean, if you   %
%can't actually _talk_ with them to a modern system (UTF-8 or UTF-16   %
%encoding seems to be pretty much the way it goes), what's the point in%
%using them?                                                           %
%                                                                      %
%So I'm pretty happy with using either the WCEM=8 or -gnatW8 methods of%
%setting the encoding to get UTF-8 input and output.  What I'm         %
%wondering now is can I get other UTF outputs to work?                 %
%                                                                      %
%I actually have the peculiar case of dealing with UTF-32 encoded      %
%files, which need to be translated to UTF-8 for editing, and back to  %
%UTF-32 for machine-use again.  It seems that it would be pretty       %
%straight-forward to just pull the file in with a straight             %
%Wide_Wide_Text_IO.Open/Get_Line system, then output via               %
%Wide_Wide_Text_IO.Put on a file where Form => "WCEM=8".  So far,      %
%though, I'm having trouble [. . .]"                                   %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Ludovic Brenta wrote during 2014:
|-------------------------------------------------------------------------|
|"As for the text that your program must process, that's really up to you.|
|Ada 95 added the Wide_Character and Wide_String to help you use 16-bit   |
|characters (not exactly UTF-16, rather supporting only the first plane   |
|of the Unicode character set); Ada 2005 added Wide_Wide_Character for    |
|32-bit characters (i.e. UTF-32 encoding) The String Encoding package is  |
|there to help you transcode text between 8-bit Latin_1, UTF-8, proper    |
|UTF-16 and UTF-32.  The new packages are there to help you but they      |
|don't do anything that wasn't possible in previous versions of Ada       |
|(i.e. you could reimplement them in Ada 95 if you so wished)."           |
|-------------------------------------------------------------------------|

Yannick Duchêne (Hibou57) wrote during 2010:
##############################################################################
#"Extract from the thread “S-expression I/O in Ada”. Subtopic moved in a     #
#separate thread for clarity.                                                #
#                                                                            #
#Le Wed, 18 Aug 2010 15:16:50 +0200, J-P. Rosen <rosen@adalog.fr> a écrit:   #
#> Slightly OT, but you (and others) might be interested to know that Ada    #
#> 2012 will include string encoding packages to the various UTF-X           #
#> encodings. These will be (are?) provided very soon by GNAT.               #
#>                                                                           #
#> See AI05-137-2                                                            #
#> (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=1.2)#
#                                                                            #
#Time for my stupid question of the day :)                                   #
#                                                                            #
#I've noticed this introduction in the last amendment, because Unicode has   #
#always been an issue/matter for me (actually use my own).                   #
#                                                                            #
#I could not avoid two questions: why no UTF-32 ? (this would not be an      #
#implementation nightmare) and why BOM handled for each string while BOM is  #
#to be used at stream/file level ? (see XML or HTML files for example). Or   #
#are these strings supposed to hold the whole content of a file/stream ?     #
#                                                                            #
#Quote:                                                                      #
#http://www.unicode.org/faq/utf_bom.html                                     #
#> A: A byte order mark (BOM) consists of the character code U+FEFF at the   #
#> beginning of a data stream                                                #
#                                                                            #
#This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML    #
#reference, HTTML reference) all says the same.                              #
#                                                                            #
#This matter, because the code point U+FEFF can stands for two different     #
#things: Zero Width No Break Space or encoding Byte Order Mark. The only     #
#way to distinguish both usage, is where-it-appears.                         #
#                                                                            #
#If it appears as the first code point of a stream, this is a BOM            #
#(heuristics may be applied to automatically switch encoding with an         #
#analysis of the first byte of a stream, this is what I do) ; if this        #
#appears any where else in a stream, this is a character code point."        #
##############################################################################

Contrarily to “Ada 2012 will include string encoding packages to the 
various UTF-X encodings”, a standard Ada package does not support UTF-32! 
Even Ada 2022 lacks!

"Table 23-6. Unicode Encoding Scheme Signatures
Encoding Scheme	Signature
UTF-8	EF BB BF
UTF-16 Big-endian	FE FF
UTF-16 Little-endian	FF FE
UTF-32 Big-endian	00 00 FE FF
UTF-32 Little-endian	FF FE 00 00"
says
HTTPS://WWW.Unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G19635

iconv --list
reports many kinds: "UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, 
UCS2, UCS4," and "UNICODE, UNICODEBIG, UNICODELITTLE," and "UTF-7-IMAP, 
UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, 
UTF7, UTF8, UTF16, UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE".

"package Ada.Strings.UTF_Encoding
   with Pure is
4/3
    -- Declarations common to the string encoding packages
    type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
5/3
    subtype UTF_String is String;
6/3
    subtype UTF_8_String is String;
7/3
    subtype UTF_16_Wide_String is Wide_String;
8/3
    Encoding_Error : exception;
9/3
    BOM_8    : constant UTF_8_String :=
                 Character'Val(16#EF#) &
                 Character'Val(16#BB#) &
                 Character'Val(16#BF#);
10/3
    BOM_16BE : constant UTF_String :=
                 Character'Val(16#FE#) &
                 Character'Val(16#FF#);
11/3
    BOM_16LE : constant UTF_String :=
                 Character'Val(16#FF#) &
                 Character'Val(16#FE#);
12/3
    BOM_16   : constant UTF_16_Wide_String :=
                (1 => Wide_Character'Val(16#FEFF#));"
says
HTTPS://AdaIC.org/resources/add_content/standards/22rm/html/RM-A-4-11.html
without UTF-32.

John or Erich Rast wrote during 2014:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^"there are plenty of converters between different Unicode versions^
^(UTF-8, UTF-16, UTF-32)."                                         ^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Contrast with
"package Ada.Strings.UTF_Encoding
   with Pure is
4/3
    -- Declarations common to the string encoding packages
    type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
[. . .]
end Ada.Strings.UTF_Encoding;
15/5
package Ada.Strings.UTF_Encoding.Conversions
    with Pure is
16/3
    -- Conversions between various encoding schemes
    function Convert (Item          : UTF_String;
                      Input_Scheme  : Encoding_Scheme;
                      Output_Scheme : Encoding_Scheme;
                      Output_BOM    : Boolean := False) return UTF_String;"
says
HTTPS://AdaIC.org/resources/add_content/standards/22rm/html/RM-A-4-11.html

"A full featured character encoding converter will have to provide the 
following 13 encoding variants of Unicode and UCS:

UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE, 
UTF-16LE, UTF-32, UTF-32BE, UTF-32LE"
says
HTTPS://WWW.CL.Cam.ac.UK/~mgk25/unicode.html

(The same webpage says:
"The term UTF-32 was introduced in Unicode to describe a 4-byte encoding 
of the extended “21-bit” Unicode. UTF-32 is the exact same thing as UCS-4, 
except that by definition UTF-32 is never used to represent characters 
above U-0010FFFF, while UCS-4 can cover all 2[**]31 code positions up to 
U-7FFFFFFF."

Contrast with:
"UCS-4 stands for “Universal Character Set coded in 4 octets.” It is now 
treated simply as a synonym for UTF-32, and is considered the canonical 
form for representation of characters in 10646."
says
HTTPS://WWW.Unicode.org/versions/Unicode16.0.0/core-spec/appendix-c
So much for standardisation!)

Randy L. Brukardt wrote during 2017:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>"In Ada,  type Character = Latin-1 = first 255 code positions, 8-bit       >
>representation. Text_IO and type String are for Latin-1 strings.           >
>                                                                           >
>type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code     >
>positions = UCS-2 = 16-bit representation.                                 >
>                                                                           >
>type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation. >
>                                                                           >
>There is no native support in Ada for UTF-8 or UTF-16 strings. There is a  >
>conversion package (Ada.Strings.Encoding) [which is nasty because it breaks>
>strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and>
>Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1  >
>(there is no good way to tell between them in the general case).           >
>                                                                           >
>Windows uses a BOM character at the start of UTF-8 files to differentiate  >
>(at least in programs like Notepad and the built-in edit control), but that>
>is not recommended by Unicode. I think they would prefer a world where     >
>Latin-1 had disappeared completely, but that of course is not the real     >
>world."                                                                    >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Luke A. Guest wrote during 2021:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!"And this is there the Ada standard gets it wrong, in the encodings!
!package re utf-8."                                                 !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Vadim Godunko wrote during 2021:
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<"Ada doesn't have good Unicode support. :( So, you need to find suitable<
<set of "workarounds"."                                                  <
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Randy L. Brukardt wrote during 2013:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>"Right. The proper thing to do (for Ada 2012) is to use            >
>Ada.Characters.Wide_Handling (or Wide_Wide_Handling) to do the case>
>conversion, after converting the UTF-8 into a Wide_String (or      >
>Wide_Wide_String)."                                                >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

However, Dmitry A. Kazakov wrote during 2021:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!"Never ever use                      !
!Wide or Wide_Wide, they are useless."!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Vadim Godunko wrote during 2022:
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<"I think ((Wide_)Wide_)(Character|String) is obsolete for modern    <
<systems and programming languages; more cleaner types and API is a  <
<requirement now. The only case when old character/string types is   <
<really makes value is low resources embedded systems; in other cases<
<their use generates a lot of hidden issues, which is very hard to   <
<detect."                                                            <
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Maxim Reznik wrote during 2021:
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
\"You can use Wide_Wide_String and Unbounded_Wide_Wide_String type to\
\process Unicode strings. But this is not very handy. I use the      \
\Matreshka library for Unicode strings."                             \
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

I do not find Matreshka to be handy. Cf. an ALIRE failure shown below.

Dmitry A. Kazakov wrote during 2021:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!"On 2021-06-21 00:50, Jeffrey R. Carter wrote:                             !
!> On 6/20/21 8:47 PM, Dmitry A. Kazakov wrote:                             !
!>> On 2021-06-20 20:21, Jeffrey R. Carter wrote:                           !
!>>                                                                         !
!>> That ship has sailed. I would say that any use of String as Latin-1 is  !
!>> a mistake now because most of the libraries would use UTF-8 encoding    !
!>> instead of Latin-1.                                                     !
!>                                                                          !
!> I have never subscribed to the illogic that if enough people make the    !
!> same mistake, it ceases to be a mistake.                                 !
!                                                                           !
!The mistake is on the Ada type system design side. People repurposed       !
!Latin-1 strings for UTF-8 strings because there was no other feasible way."!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Cf.
"Why do people do this?!
Honestly, I don't really know. This is one of those mysteries that might
never get solved. Oh, there is one lead: it seems to be generated mostly
(exclusively?) by Windows systems. Really, who would have thought?"
says
HTTPS://WWW.ueber.net/who/mjl/projects/bomstrip

Cf.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~"For a long time, it was believed that Unicode could get by with 16 ~
~bits to represent the characters for all languages of the           ~
~world. Originally, “Unicode” was defined as “16 bit                 ~
~characters”. History showed this was a bad idea, but it was believed~
~to be true for long enough that many systems are stuck with 16 bit  ~
~characters; both Java and Windows, for example, deal in 16 bit      ~
~characters."                                                        ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
says
HTTPS://EntropicThoughts.com/unicode-strings-in-ada-2012
by Christoffer Stjernlöf.

Cf.
"One hundred repetitions three nights a week for four years, thought 
Bernard Marx, who was a specialist on hypnopædia. Sixty-two thousand four 
hundred repetitions make one truth. Idiots!"
says
@book{Sixty-two-thousand-four-hundred-repetitions-make-one-truth-Idiots,
author={Aldous Huxley},
title={{Brave New World}},
publisher={Chatto \& Windus with T. and A. Constable with the University 
Press Edinburgh},
address={London and Edinburgh},
year={1932}
}

Cf. publications by psychologists. E.g. Kimberlee Weaver; Stephen M. 
Garcia; Norbert Schwarz; and Dale T. Miller, "Inferring the Popularity of 
an Opinion From Its Familiarity: A Repetitive Voice Can Sound Like a 
Chorus", "Journal of Personality and Social Psychology", 92(5):821-833, 
2007.

Cf. "majority opinion turns out to be wrong with a fairly high frequency 
in science"
says
James Woodward and David Goodstein, “Conduct, Misconduct and the 
Structure of Science,” September–October, "American Scientist", 1996, 
479–490.

Shark8 wrote during 2013:
////////////////////////////////////////////////////////////////////////
/"UTF-16 is perhaps the worst possible encoding you can have for       /
/Unicode. With UTF-8 you don't need to worry about byte-order          /
/(everything's sequential) and with UTF-32 you don't need to decode the/
/information (each element *IS* a code-point)... but UTF-16 offers     /
/neither of these."                                                    /
////////////////////////////////////////////////////////////////////////

Randy Brukardt wrote during 2023:
******************************************************************************
*"But my opinion is that Ada got strings completely wrong, and the best thing*
*to do with them is to completely nuke them and start over. [. . .]"         *
******************************************************************************

I have been given a dataset. These files are supposedly homogeneous UTF-8 
XML files. Actually
for data_file in *.xml ; do file $data_file | sed -e 's/^.*: //' ; done | 
sort | uniq
reports:
"ASCII text, with CRLF line terminators
Unicode text, UTF-8 text, with CRLF line terminators
XML 1.0 document, Unicode text, UTF-8 (with BOM) text, with CRLF line 
terminators".
(If  file  does not call an example "XML 1.0 document, Unicode [. . .]" 
then such an example lacks a line with
<?xml version='1.0' encoding='utf-8'?>
but does consist of XML parts.)

A valid letter in this language expressed in UTF-8 octets can have:
1 octet (e.g. 16#41#);
2 octets (e.g. 16#C3_BA#);
or
3 octets (e.g. 16#E1_BA_9B#).
I do not believe that I am overlooking a 4-octet example . . . but what 
if?

This is not a constrained computer. It will not run out of memory. It is 
not slow. Deadly Head needs UTF-32. I do not need UTF-32 or UCS-4 for this 
application, but elegance might promote a uniform quantity of octets for 
all letters; and a polyglot user might try to insert some weird 
punctuation or whatever which I do not know or might copy and paste some 
multilingual table from Unicode.org. I do not want
"a lot of hidden issues, which is very hard to
detect"
as Vadim Godunko said. I do not want a crash, especially with some 
exception which is less informative than a Java exception. Granted, all 
these already existing files are in UTF-8. But what if some future 
application will need general UCS4?

Sincères salutations.



Nicolas Paul Colin de Glocester

cd Matreshka_league__ALIRE_failed_to_build_this

/home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this 
$ alr get matreshka_league
ⓘ Running post_fetch actions for matreshka_league=21.0.0...
[. . .]
configure: creating source/league/matreshka-config.ads

matreshka_league=21.0.0 successfully retrieved.
Dependencies were solved as follows:

    + make 4.3.0 (new)


/home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this 
$ cd matreshka_league_21.0.0_0c8f4d47

/home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this/matreshka_league_21.0.0_0c8f4d47 
$ alr run
ⓘ Building matreshka_league/gnat/matreshka_league.gpr...
Compile
    [Ada]          xml-sax-simple_readers-scanner.adb
[. . .]
league-iris.adb:1476:36: warning: Is_Valid unimplemented [enabled by 
default]
[. . .]
    [Ada]          matreshka-cldr-collation_rules_parser.adb
matreshka-internals-utf16.ads:100:04: warning: pragma Pack for 
"Utf16_String" ignored [-gnatwr]
[. . .]
    [Ada]          league-calendars-iso_8601.adb
matreshka-cldr-collation_rules_parser.adb:186:30: warning: assignment to 
pass-by-copy formal may have no effect [enabled by default]
matreshka-cldr-collation_rules_parser.adb:186:30: warning: "raise" 
statement may result in abnormal return (RM 6.4.1(17)) [enabled by 
default]
[. . .]
    [Ada]          matreshka-atomics-generic_test_and_set__gcc__64.adb
matreshka-atomics-counters__gcc.adb:50:14: warning: intrinsic binding type 
mismatch on parameter 2 [enabled by default]
matreshka-atomics-counters__gcc.adb:50:14: warning: profile of 
"Sync_Add_And_Fetch_32" doesn't match the builtin it binds [enabled by 
default]
matreshka-atomics-counters__gcc.adb:54:13: warning: intrinsic binding type 
mismatch on result [enabled by default]
matreshka-atomics-counters__gcc.adb:54:13: warning: intrinsic binding type 
mismatch on parameter 2 [enabled by default]
matreshka-atomics-counters__gcc.adb:54:13: warning: profile of 
"Sync_Sub_And_Fetch_32" doesn't match the builtin it binds [enabled by 
default]
matreshka-atomics-counters__gcc.adb:57:14: warning: intrinsic binding type 
mismatch on parameter 2 [enabled by default]
matreshka-atomics-counters__gcc.adb:57:14: warning: profile of 
"Sync_Sub_And_Fetch_32" doesn't match the builtin it binds [enabled by 
default]
[. . .]
league-locales.ads:46:12: warning: unit "League.Strings" is not referenced 
[-gnatwu]

    compilation of matreshka-internals-unicode-ucd-properties.adb failed
    compilation of league-strings-cursors-grapheme_clusters.adb failed
    compilation of matreshka-internals-code_point_sets.adb failed
    compilation of league-character_sets.adb failed
    compilation of matreshka-internals-unicode-ucd-norms.ads failed
    compilation of matreshka-internals-unicode-ucd-core.ads failed

gprbuild: *** compilation phase failed
error: Command ["gprbuild", "-s", "-j0", "-p", "-P", 
"/coldstorage/gloucester/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this/matreshka_league_21.0.0_0c8f4d47/gnat/matreshka_league.gpr"] 
exited with code 4
error: Build failed

/home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this/matreshka_league_21.0.0_0c8f4d47 
$ date
Tue Aug 26 12:03:12 CEST 2025