* Ada 2012 and Unicode package (UTF-nn encodings handling)
@ 2010-08-20 21:38 Yannick Duchêne (Hibou57)
2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
` (4 more replies)
0 siblings, 5 replies; 27+ messages in thread
From: Yannick Duchêne (Hibou57) @ 2010-08-20 21:38 UTC (permalink / raw)
Extract from the thread “S-expression I/O in Ada”. Subtopic moved in a
separate thread for clarity.
Le Wed, 18 Aug 2010 15:16:50 +0200, J-P. Rosen <rosen@adalog.fr> a écrit:
> Slightly OT, but you (and others) might be interested to know that Ada
> 2012 will include string encoding packages to the various UTF-X
> encodings. These will be (are?) provided very soon by GNAT.
>
> See AI05-137-2
> (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=1.2)
Time for my stupid question of the day :)
I've noticed this introduction in the last amendment, because Unicode has
always been an issue/matter for me (actually use my own).
I could not avoid two questions: why no UTF-32 ? (this would not be an
implementation nightmare) and why BOM handled for each string while BOM is
to be used at stream/file level ? (see XML or HTML files for example). Or
are these strings supposed to hold the whole content of a file/stream ?
Quote:
http://www.unicode.org/faq/utf_bom.html
> A: A byte order mark (BOM) consists of the character code U+FEFF at the
> beginning of a data stream
This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML
reference, HTTML reference) all says the same.
This matter, because the code point U+FEFF can stands for two different
things: Zero Width No Break Space or encoding Byte Order Mark. The only
way to distinguish both usage, is where-it-appears.
If it appears as the first code point of a stream, this is a BOM
(heuristics may be applied to automatically switch encoding with an
analysis of the first byte of a stream, this is what I do) ; if this
appears any where else in a stream, this is a character code point.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
@ 2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
2010-08-21 6:21 ` Dmitry A. Kazakov
` (3 subsequent siblings)
4 siblings, 0 replies; 27+ messages in thread
From: Yannick Duchêne (Hibou57) @ 2010-08-20 21:41 UTC (permalink / raw)
Le Fri, 20 Aug 2010 23:38:20 +0200, Yannick Duchêne (Hibou57)
<yannick_duchene@yahoo.fr> a écrit:
> (heuristics may be applied to automatically switch encoding with an
> analysis of the first byte of a stream
Mistake: read “analysis of the first byteS” (plural)
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
@ 2010-08-21 6:21 ` Dmitry A. Kazakov
2010-08-21 7:01 ` J-P. Rosen
` (2 subsequent siblings)
4 siblings, 0 replies; 27+ messages in thread
From: Dmitry A. Kazakov @ 2010-08-21 6:21 UTC (permalink / raw)
On Fri, 20 Aug 2010 23:38:20 +0200, Yannick Duch�ne (Hibou57) wrote:
> I could not avoid two questions: why no UTF-32 ?
Is there anybody who would ever use it?
> Quote:
> http://www.unicode.org/faq/utf_bom.html
>> A: A byte order mark (BOM) consists of the character code U+FEFF at the
>> beginning of a data stream
>
> This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML
> reference, HTTML reference) all says the same.
That is all OS's business, how does it handle the content. So if it belongs
anywhere then to stream I/O + directories.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
2010-08-21 6:21 ` Dmitry A. Kazakov
@ 2010-08-21 7:01 ` J-P. Rosen
2010-08-21 8:12 ` Yannick Duchêne (Hibou57)
2010-08-23 22:28 ` Randy Brukardt
2025-08-31 17:39 ` Ada 202x; 2022; and " Nicolas Paul Colin de Glocester
4 siblings, 1 reply; 27+ messages in thread
From: J-P. Rosen @ 2010-08-21 7:01 UTC (permalink / raw)
Le 20/08/2010 23:38, Yannick Duchêne (Hibou57) a écrit :
> Time for my stupid question of the day :)
A question is never stupid. Answers sometimes...
> I could not avoid two questions: why no UTF-32 ? (this would not be an
> implementation nightmare)
I still fail to see the benefit of encoding 31 bits values into 32 bits
values...
And even if implementation is not a nightmare, it always has a cost.
Implementers are reluctant to spend money for features that nobody will
use. (Wide_Wide_Character was forced on us by ISO).
> and why BOM handled for each string while BOM
> is to be used at stream/file level ? (see XML or HTML files for
> example).
A package provides functionnalities. It should not presume how it is
used. Since this package is clearly in the "string handling" class, it
makes sense to handle this with strings.
For files, the usage is to have a BOM on the first line of the file. The
way the functions are defined makes it easy to not process the first
line specially; see the use case in the AI.
--
---------------------------------------------------------
J-P. Rosen (rosen@adalog.fr)
Visit Adalog's web site at http://www.adalog.fr
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
2010-08-21 7:01 ` J-P. Rosen
@ 2010-08-21 8:12 ` Yannick Duchêne (Hibou57)
2010-08-22 18:51 ` J-P. Rosen
0 siblings, 1 reply; 27+ messages in thread
From: Yannick Duchêne (Hibou57) @ 2010-08-21 8:12 UTC (permalink / raw)
> I still fail to see the benefit of encoding 31 bits values into 32 bits
> values...
UTF-32 is not formally an encoding format, it would better be referred to
as a matter of Byte order. But this byte order is not system dependent, it
is cross-platform data dependent.
> And even if implementation is not a nightmare, it always has a cost.
> Implementers are reluctant to spend money for features that nobody will
> use. (Wide_Wide_Character was forced on us by ISO).
I suppose the ISO forced the introduction of Wide_Wide_Character because
it is part of the Unicode standard, and as you know, conformance requires
full-conformance. There is no part-of with this, because as soon and it is
defined, this may really have occurrences.
Imagine a web crawler: it would have to be designed with this option in
mind. Designers could not say “We do not feel UTF-32 is useful, our
crawler will then not be offered the capabilities of handling such
documents”.
I just though this was a little pity, if one want to rely on the standard
packages capabilities, then this one will only be able to do it partially.
This would be a bit like Two way linked list without the one way (or the
opposite). A matter of completeness.
> A package provides functionnalities. It should not presume how it is
> used. Since this package is clearly in the "string handling" class, it
> makes sense to handle this with strings.
Right, this is defined in *String*_Encoding.
> For files, the usage is to have a BOM on the first line of the file. The
> way the functions are defined makes it easy to not process the first
> line specially; see the use case in the AI.
I just had a look back at
http://www.ada-auth.org/standards/12aarm/html/AA-A-4-11.html
Only Encode has this capability (via Output_BOM : Boolean). Decode/Convert
has nothing similar and will always skip any 16#FEFF# which will be
interpreted as a BOM instead of as a character (there is nothing like an
Interpret_BOM : Boolean).
But may be I am missing something. Will have a deeper look at it and at
the AI which come with it (I saw UTF-32 was at least “pronounced” during
the talk).
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
2010-08-21 8:12 ` Yannick Duchêne (Hibou57)
@ 2010-08-22 18:51 ` J-P. Rosen
2010-08-22 19:48 ` Georg Bauhaus
0 siblings, 1 reply; 27+ messages in thread
From: J-P. Rosen @ 2010-08-22 18:51 UTC (permalink / raw)
Le 21/08/2010 10:12, Yannick Duchêne (Hibou57) a écrit :
> I just had a look back at
> http://www.ada-auth.org/standards/12aarm/html/AA-A-4-11.html
> Only Encode has this capability (via Output_BOM : Boolean).
> Decode/Convert has nothing similar and will always skip any 16#FEFF#
> which will be interpreted as a BOM instead of as a character (there is
> nothing like an Interpret_BOM : Boolean).
>
> But may be I am missing something. Will have a deeper look at it and at
> the AI which come with it (I saw UTF-32 was at least “pronounced” during
> the talk).
I think you missed the "Encoding" function. The intended usage
(extracted from the !discussion section) is:
1) Read the first line. Call function Encoding on that line with an
appropriate default to use if the line does not start with a
BOM. Initialize the encoding scheme to the value returned by the
function.
2) Decode all lines (including the first one) with the chosen encoding
scheme. Since the BOM is ignored by Decode functions, it is not
necessary to slice the first line specially.
--
---------------------------------------------------------
J-P. Rosen (rosen@adalog.fr)
Visit Adalog's web site at http://www.adalog.fr
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
2010-08-22 18:51 ` J-P. Rosen
@ 2010-08-22 19:48 ` Georg Bauhaus
2010-08-22 20:40 ` J-P. Rosen
0 siblings, 1 reply; 27+ messages in thread
From: Georg Bauhaus @ 2010-08-22 19:48 UTC (permalink / raw)
On 8/22/10 8:51 PM, J-P. Rosen wrote:
> I think you missed the "Encoding" function. The intended usage
> (extracted from the !discussion section) is:
> 1) Read the first line. Call function Encoding on that line with an
> appropriate default to use if the line does not start with a
> BOM. Initialize the encoding scheme to the value returned by the
> function.
Since Ada is an ISO language, is the name BOM for the non-UTF-8
thing used by Microsoft actually ISO? (I.e., has it become part of ISO 10646)?
Georg
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
2010-08-22 19:48 ` Georg Bauhaus
@ 2010-08-22 20:40 ` J-P. Rosen
2010-08-23 10:32 ` Georg Bauhaus
0 siblings, 1 reply; 27+ messages in thread
From: J-P. Rosen @ 2010-08-22 20:40 UTC (permalink / raw)
Le 22/08/2010 21:48, Georg Bauhaus a écrit :
> On 8/22/10 8:51 PM, J-P. Rosen wrote:
>
>> I think you missed the "Encoding" function. The intended usage
>> (extracted from the !discussion section) is:
>> 1) Read the first line. Call function Encoding on that line with an
>> appropriate default to use if the line does not start with a
>> BOM. Initialize the encoding scheme to the value returned by the
>> function.
>
> Since Ada is an ISO language, is the name BOM for the non-UTF-8
> thing used by Microsoft actually ISO? (I.e., has it become part of ISO
> 10646)?
>
It's from Unicode. ISO 10646 defines only character encodings
(code-points). Unicode uses the same encodings, and in addition defines
UTF-8 and siblings.
--
---------------------------------------------------------
J-P. Rosen (rosen@adalog.fr)
Visit Adalog's web site at http://www.adalog.fr
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
2010-08-22 20:40 ` J-P. Rosen
@ 2010-08-23 10:32 ` Georg Bauhaus
0 siblings, 0 replies; 27+ messages in thread
From: Georg Bauhaus @ 2010-08-23 10:32 UTC (permalink / raw)
On 22.08.10 22:40, J-P. Rosen wrote:
> Le 22/08/2010 21:48, Georg Bauhaus a écrit :
>> On 8/22/10 8:51 PM, J-P. Rosen wrote:
>>
>>> I think you missed the "Encoding" function. The intended usage
>>> (extracted from the !discussion section) is:
>>> 1) Read the first line. Call function Encoding on that line with an
>>> appropriate default to use if the line does not start with a
>>> BOM. Initialize the encoding scheme to the value returned by the
>>> function.
>>
>> Since Ada is an ISO language, is the name BOM for the non-UTF-8
>> thing used by Microsoft actually ISO? (I.e., has it become part of ISO
>> 10646)?
>>
> It's from Unicode. ISO 10646 defines only character encodings
> (code-points).
Uhm, minor nitpicking ; ISO/IEC 10646:2003
"* specifies a multiple byte (one to four) byte transformation
UTF-8 for use with ISO 646 (ASCII) byte-oriented environments;
"* specifies a two 16-bit form and associated transformation
UTF-16 for supplementary characters;"
(and LRM A.4.11 seems too mention, IINM.)
Markus Kuhn explains why in POSIX environments UTF-8 files---that
never have a byte order issue---should *not* have a BOM "signature".
It is, therefore, a good thing that Convert/Encode turn off outputting a
"BOM used as signature" byte sequence, since that sequence works on recent
Windows(TM) platforms but creates problems on the ISO standards compliant
platforms.
http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf
"It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF)
as a signature to mark the beginning of a UTF-8 file. This practice
should definitely not be used on POSIX systems for several reasons:
..."
Indeed, program source files that use "incorrect" Microsoft UTF-8
signatures do create problems with Eclipse when they are used
with both Windows and GNU/Linux editions of Eclipse.
Georg
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 2012 and Unicode package (UTF-nn encodings handling)
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
` (2 preceding siblings ...)
2010-08-21 7:01 ` J-P. Rosen
@ 2010-08-23 22:28 ` Randy Brukardt
2025-08-31 17:39 ` Ada 202x; 2022; and " Nicolas Paul Colin de Glocester
4 siblings, 0 replies; 27+ messages in thread
From: Randy Brukardt @ 2010-08-23 22:28 UTC (permalink / raw)
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 888 bytes --]
"Yannick Duch�ne (Hibou57)" <yannick_duchene@yahoo.fr> wrote in message
news:op.vhrad6mjule2fv@garhos...
...
>> See AI05-137-2
>> (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=1.2)
>
>Time for my stupid question of the day :)
>
>I could not avoid two questions: why no UTF-32 ? (this would not be an
>implementation nightmare) and why BOM handled for each string while BOM is
>to be used at stream/file level ? (see XML or HTML files for example). Or
>are these strings supposed to hold the whole content of a file/stream ?
Did you read the AI? There is a reason that I put links to the AIs into
these messages and links to the AIs from the AARM online. Each AI includes a
!discussion section which typically includes some discussion of the design
decisions. In this case, both of these questions are answered in the AI.
Randy.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
` (3 preceding siblings ...)
2010-08-23 22:28 ` Randy Brukardt
@ 2025-08-31 17:39 ` Nicolas Paul Colin de Glocester
2025-08-31 21:23 ` Kevin Chadwick
2025-09-02 16:01 ` Alex // nytpu
4 siblings, 2 replies; 27+ messages in thread
From: Nicolas Paul Colin de Glocester @ 2025-08-31 17:39 UTC (permalink / raw)
[-- Attachment #1: Type: text/plain, Size: 23758 bytes --]
Dear Adaists,
Björn Persson wrote during 2006:
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$"Gnat's approach to character encodings is$
$amazingly faulty." $
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Björn Persson wrote during 2006:
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$"> System.WCh_Cnv confound JIS character code with Unicode, it makes $
$> troubles. Wide_Text_IO (and -gnatWs, -gantWe) are useless in fact, $
$> because there is no what uses JIS character code as it is, conversion$
$> is needed after all. $
$ $
$I haven't used that package myself so I don't know how it works, but I $
$won't be surprised if it's buggy. In my experience, Adacore's handling $
$of character encodings is rather unimpressive." $
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Deadly Head wrote during 2010:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%"This is a pretty big deal to me. For a long time I've been a bit... %
%frustrated? ... by the fact that the Ada standard specifically gives %
%us Wide_ and Wide_Wide_Characters and their associated strings, but %
%actually _using_ them seemed pretty much worthless. I mean, if you %
%can't actually _talk_ with them to a modern system (UTF-8 or UTF-16 %
%encoding seems to be pretty much the way it goes), what's the point in%
%using them? %
% %
%So I'm pretty happy with using either the WCEM=8 or -gnatW8 methods of%
%setting the encoding to get UTF-8 input and output. What I'm %
%wondering now is can I get other UTF outputs to work? %
% %
%I actually have the peculiar case of dealing with UTF-32 encoded %
%files, which need to be translated to UTF-8 for editing, and back to %
%UTF-32 for machine-use again. It seems that it would be pretty %
%straight-forward to just pull the file in with a straight %
%Wide_Wide_Text_IO.Open/Get_Line system, then output via %
%Wide_Wide_Text_IO.Put on a file where Form => "WCEM=8". So far, %
%though, I'm having trouble [. . .]" %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Ludovic Brenta wrote during 2014:
|-------------------------------------------------------------------------|
|"As for the text that your program must process, that's really up to you.|
|Ada 95 added the Wide_Character and Wide_String to help you use 16-bit |
|characters (not exactly UTF-16, rather supporting only the first plane |
|of the Unicode character set); Ada 2005 added Wide_Wide_Character for |
|32-bit characters (i.e. UTF-32 encoding) The String Encoding package is |
|there to help you transcode text between 8-bit Latin_1, UTF-8, proper |
|UTF-16 and UTF-32. The new packages are there to help you but they |
|don't do anything that wasn't possible in previous versions of Ada |
|(i.e. you could reimplement them in Ada 95 if you so wished)." |
|-------------------------------------------------------------------------|
Yannick Duchêne (Hibou57) wrote during 2010:
##############################################################################
#"Extract from the thread “S-expression I/O in Ada”. Subtopic moved in a #
#separate thread for clarity. #
# #
#Le Wed, 18 Aug 2010 15:16:50 +0200, J-P. Rosen <rosen@adalog.fr> a écrit: #
#> Slightly OT, but you (and others) might be interested to know that Ada #
#> 2012 will include string encoding packages to the various UTF-X #
#> encodings. These will be (are?) provided very soon by GNAT. #
#> #
#> See AI05-137-2 #
#> (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0137-2.txt?rev=1.2)#
# #
#Time for my stupid question of the day :) #
# #
#I've noticed this introduction in the last amendment, because Unicode has #
#always been an issue/matter for me (actually use my own). #
# #
#I could not avoid two questions: why no UTF-32 ? (this would not be an #
#implementation nightmare) and why BOM handled for each string while BOM is #
#to be used at stream/file level ? (see XML or HTML files for example). Or #
#are these strings supposed to hold the whole content of a file/stream ? #
# #
#Quote: #
#http://www.unicode.org/faq/utf_bom.html #
#> A: A byte order mark (BOM) consists of the character code U+FEFF at the #
#> beginning of a data stream #
# #
#This is a FAQ at Unicode.org; but all references (Unicode PDF files, XML #
#reference, HTTML reference) all says the same. #
# #
#This matter, because the code point U+FEFF can stands for two different #
#things: Zero Width No Break Space or encoding Byte Order Mark. The only #
#way to distinguish both usage, is where-it-appears. #
# #
#If it appears as the first code point of a stream, this is a BOM #
#(heuristics may be applied to automatically switch encoding with an #
#analysis of the first byte of a stream, this is what I do) ; if this #
#appears any where else in a stream, this is a character code point." #
##############################################################################
Contrarily to “Ada 2012 will include string encoding packages to the
various UTF-X encodings”, a standard Ada package does not support UTF-32!
Even Ada 2022 lacks!
"Table 23-6. Unicode Encoding Scheme Signatures
Encoding Scheme Signature
UTF-8 EF BB BF
UTF-16 Big-endian FE FF
UTF-16 Little-endian FF FE
UTF-32 Big-endian 00 00 FE FF
UTF-32 Little-endian FF FE 00 00"
says
HTTPS://WWW.Unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G19635
iconv --list
reports many kinds: "UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE,
UCS2, UCS4," and "UNICODE, UNICODEBIG, UNICODELITTLE," and "UTF-7-IMAP,
UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE,
UTF7, UTF8, UTF16, UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE".
"package Ada.Strings.UTF_Encoding
with Pure is
4/3
-- Declarations common to the string encoding packages
type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
5/3
subtype UTF_String is String;
6/3
subtype UTF_8_String is String;
7/3
subtype UTF_16_Wide_String is Wide_String;
8/3
Encoding_Error : exception;
9/3
BOM_8 : constant UTF_8_String :=
Character'Val(16#EF#) &
Character'Val(16#BB#) &
Character'Val(16#BF#);
10/3
BOM_16BE : constant UTF_String :=
Character'Val(16#FE#) &
Character'Val(16#FF#);
11/3
BOM_16LE : constant UTF_String :=
Character'Val(16#FF#) &
Character'Val(16#FE#);
12/3
BOM_16 : constant UTF_16_Wide_String :=
(1 => Wide_Character'Val(16#FEFF#));"
says
HTTPS://AdaIC.org/resources/add_content/standards/22rm/html/RM-A-4-11.html
without UTF-32.
John or Erich Rast wrote during 2014:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^"there are plenty of converters between different Unicode versions^
^(UTF-8, UTF-16, UTF-32)." ^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Contrast with
"package Ada.Strings.UTF_Encoding
with Pure is
4/3
-- Declarations common to the string encoding packages
type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
[. . .]
end Ada.Strings.UTF_Encoding;
15/5
package Ada.Strings.UTF_Encoding.Conversions
with Pure is
16/3
-- Conversions between various encoding schemes
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;"
says
HTTPS://AdaIC.org/resources/add_content/standards/22rm/html/RM-A-4-11.html
"A full featured character encoding converter will have to provide the
following 13 encoding variants of Unicode and UCS:
UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16, UTF-16BE,
UTF-16LE, UTF-32, UTF-32BE, UTF-32LE"
says
HTTPS://WWW.CL.Cam.ac.UK/~mgk25/unicode.html
(The same webpage says:
"The term UTF-32 was introduced in Unicode to describe a 4-byte encoding
of the extended “21-bit” Unicode. UTF-32 is the exact same thing as UCS-4,
except that by definition UTF-32 is never used to represent characters
above U-0010FFFF, while UCS-4 can cover all 2[**]31 code positions up to
U-7FFFFFFF."
Contrast with:
"UCS-4 stands for “Universal Character Set coded in 4 octets.” It is now
treated simply as a synonym for UTF-32, and is considered the canonical
form for representation of characters in 10646."
says
HTTPS://WWW.Unicode.org/versions/Unicode16.0.0/core-spec/appendix-c
So much for standardisation!)
Randy L. Brukardt wrote during 2017:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>"In Ada, type Character = Latin-1 = first 255 code positions, 8-bit >
>representation. Text_IO and type String are for Latin-1 strings. >
> >
>type Wide_Charater = BMP (Basic Multilingual Plane) = first 65535 code >
>positions = UCS-2 = 16-bit representation. >
> >
>type Wide_Wide_Character = all of Unicode = UCS-4 = 32-bit representation. >
> >
>There is no native support in Ada for UTF-8 or UTF-16 strings. There is a >
>conversion package (Ada.Strings.Encoding) [which is nasty because it breaks>
>strong typing] which lets one use UTF-8 and UTF-16 strings with Text_IO and>
>Wide_Text_IO. But you have to know if you are reading in UTF-8 or Latin-1 >
>(there is no good way to tell between them in the general case). >
> >
>Windows uses a BOM character at the start of UTF-8 files to differentiate >
>(at least in programs like Notepad and the built-in edit control), but that>
>is not recommended by Unicode. I think they would prefer a world where >
>Latin-1 had disappeared completely, but that of course is not the real >
>world." >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Luke A. Guest wrote during 2021:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!"And this is there the Ada standard gets it wrong, in the encodings!
!package re utf-8." !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Vadim Godunko wrote during 2021:
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<"Ada doesn't have good Unicode support. :( So, you need to find suitable<
<set of "workarounds"." <
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Randy L. Brukardt wrote during 2013:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>"Right. The proper thing to do (for Ada 2012) is to use >
>Ada.Characters.Wide_Handling (or Wide_Wide_Handling) to do the case>
>conversion, after converting the UTF-8 into a Wide_String (or >
>Wide_Wide_String)." >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
However, Dmitry A. Kazakov wrote during 2021:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!"Never ever use !
!Wide or Wide_Wide, they are useless."!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Vadim Godunko wrote during 2022:
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<"I think ((Wide_)Wide_)(Character|String) is obsolete for modern <
<systems and programming languages; more cleaner types and API is a <
<requirement now. The only case when old character/string types is <
<really makes value is low resources embedded systems; in other cases<
<their use generates a lot of hidden issues, which is very hard to <
<detect." <
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Maxim Reznik wrote during 2021:
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
\"You can use Wide_Wide_String and Unbounded_Wide_Wide_String type to\
\process Unicode strings. But this is not very handy. I use the \
\Matreshka library for Unicode strings." \
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
I do not find Matreshka to be handy. Cf. an ALIRE failure shown below.
Dmitry A. Kazakov wrote during 2021:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!"On 2021-06-21 00:50, Jeffrey R. Carter wrote: !
!> On 6/20/21 8:47 PM, Dmitry A. Kazakov wrote: !
!>> On 2021-06-20 20:21, Jeffrey R. Carter wrote: !
!>> !
!>> That ship has sailed. I would say that any use of String as Latin-1 is !
!>> a mistake now because most of the libraries would use UTF-8 encoding !
!>> instead of Latin-1. !
!> !
!> I have never subscribed to the illogic that if enough people make the !
!> same mistake, it ceases to be a mistake. !
! !
!The mistake is on the Ada type system design side. People repurposed !
!Latin-1 strings for UTF-8 strings because there was no other feasible way."!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Cf.
"Why do people do this?!
Honestly, I don't really know. This is one of those mysteries that might
never get solved. Oh, there is one lead: it seems to be generated mostly
(exclusively?) by Windows systems. Really, who would have thought?"
says
HTTPS://WWW.ueber.net/who/mjl/projects/bomstrip
Cf.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~"For a long time, it was believed that Unicode could get by with 16 ~
~bits to represent the characters for all languages of the ~
~world. Originally, “Unicode” was defined as “16 bit ~
~characters”. History showed this was a bad idea, but it was believed~
~to be true for long enough that many systems are stuck with 16 bit ~
~characters; both Java and Windows, for example, deal in 16 bit ~
~characters." ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
says
HTTPS://EntropicThoughts.com/unicode-strings-in-ada-2012
by Christoffer Stjernlöf.
Cf.
"One hundred repetitions three nights a week for four years, thought
Bernard Marx, who was a specialist on hypnopædia. Sixty-two thousand four
hundred repetitions make one truth. Idiots!"
says
@book{Sixty-two-thousand-four-hundred-repetitions-make-one-truth-Idiots,
author={Aldous Huxley},
title={{Brave New World}},
publisher={Chatto \& Windus with T. and A. Constable with the University
Press Edinburgh},
address={London and Edinburgh},
year={1932}
}
Cf. publications by psychologists. E.g. Kimberlee Weaver; Stephen M.
Garcia; Norbert Schwarz; and Dale T. Miller, "Inferring the Popularity of
an Opinion From Its Familiarity: A Repetitive Voice Can Sound Like a
Chorus", "Journal of Personality and Social Psychology", 92(5):821-833,
2007.
Cf. "majority opinion turns out to be wrong with a fairly high frequency
in science"
says
James Woodward and David Goodstein, “Conduct, Misconduct and the
Structure of Science,” September–October, "American Scientist", 1996,
479–490.
Shark8 wrote during 2013:
////////////////////////////////////////////////////////////////////////
/"UTF-16 is perhaps the worst possible encoding you can have for /
/Unicode. With UTF-8 you don't need to worry about byte-order /
/(everything's sequential) and with UTF-32 you don't need to decode the/
/information (each element *IS* a code-point)... but UTF-16 offers /
/neither of these." /
////////////////////////////////////////////////////////////////////////
Randy Brukardt wrote during 2023:
******************************************************************************
*"But my opinion is that Ada got strings completely wrong, and the best thing*
*to do with them is to completely nuke them and start over. [. . .]" *
******************************************************************************
I have been given a dataset. These files are supposedly homogeneous UTF-8
XML files. Actually
for data_file in *.xml ; do file $data_file | sed -e 's/^.*: //' ; done |
sort | uniq
reports:
"ASCII text, with CRLF line terminators
Unicode text, UTF-8 text, with CRLF line terminators
XML 1.0 document, Unicode text, UTF-8 (with BOM) text, with CRLF line
terminators".
(If file does not call an example "XML 1.0 document, Unicode [. . .]"
then such an example lacks a line with
<?xml version='1.0' encoding='utf-8'?>
but does consist of XML parts.)
A valid letter in this language expressed in UTF-8 octets can have:
1 octet (e.g. 16#41#);
2 octets (e.g. 16#C3_BA#);
or
3 octets (e.g. 16#E1_BA_9B#).
I do not believe that I am overlooking a 4-octet example . . . but what
if?
This is not a constrained computer. It will not run out of memory. It is
not slow. Deadly Head needs UTF-32. I do not need UTF-32 or UCS-4 for this
application, but elegance might promote a uniform quantity of octets for
all letters; and a polyglot user might try to insert some weird
punctuation or whatever which I do not know or might copy and paste some
multilingual table from Unicode.org. I do not want
"a lot of hidden issues, which is very hard to
detect"
as Vadim Godunko said. I do not want a crash, especially with some
exception which is less informative than a Java exception. Granted, all
these already existing files are in UTF-8. But what if some future
application will need general UCS4?
Sincères salutations.
Nicolas Paul Colin de Glocester
cd Matreshka_league__ALIRE_failed_to_build_this
/home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this
$ alr get matreshka_league
ⓘ Running post_fetch actions for matreshka_league=21.0.0...
[. . .]
configure: creating source/league/matreshka-config.ads
matreshka_league=21.0.0 successfully retrieved.
Dependencies were solved as follows:
+ make 4.3.0 (new)
/home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this
$ cd matreshka_league_21.0.0_0c8f4d47
/home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this/matreshka_league_21.0.0_0c8f4d47
$ alr run
ⓘ Building matreshka_league/gnat/matreshka_league.gpr...
Compile
[Ada] xml-sax-simple_readers-scanner.adb
[. . .]
league-iris.adb:1476:36: warning: Is_Valid unimplemented [enabled by
default]
[. . .]
[Ada] matreshka-cldr-collation_rules_parser.adb
matreshka-internals-utf16.ads:100:04: warning: pragma Pack for
"Utf16_String" ignored [-gnatwr]
[. . .]
[Ada] league-calendars-iso_8601.adb
matreshka-cldr-collation_rules_parser.adb:186:30: warning: assignment to
pass-by-copy formal may have no effect [enabled by default]
matreshka-cldr-collation_rules_parser.adb:186:30: warning: "raise"
statement may result in abnormal return (RM 6.4.1(17)) [enabled by
default]
[. . .]
[Ada] matreshka-atomics-generic_test_and_set__gcc__64.adb
matreshka-atomics-counters__gcc.adb:50:14: warning: intrinsic binding type
mismatch on parameter 2 [enabled by default]
matreshka-atomics-counters__gcc.adb:50:14: warning: profile of
"Sync_Add_And_Fetch_32" doesn't match the builtin it binds [enabled by
default]
matreshka-atomics-counters__gcc.adb:54:13: warning: intrinsic binding type
mismatch on result [enabled by default]
matreshka-atomics-counters__gcc.adb:54:13: warning: intrinsic binding type
mismatch on parameter 2 [enabled by default]
matreshka-atomics-counters__gcc.adb:54:13: warning: profile of
"Sync_Sub_And_Fetch_32" doesn't match the builtin it binds [enabled by
default]
matreshka-atomics-counters__gcc.adb:57:14: warning: intrinsic binding type
mismatch on parameter 2 [enabled by default]
matreshka-atomics-counters__gcc.adb:57:14: warning: profile of
"Sync_Sub_And_Fetch_32" doesn't match the builtin it binds [enabled by
default]
[. . .]
league-locales.ads:46:12: warning: unit "League.Strings" is not referenced
[-gnatwu]
compilation of matreshka-internals-unicode-ucd-properties.adb failed
compilation of league-strings-cursors-grapheme_clusters.adb failed
compilation of matreshka-internals-code_point_sets.adb failed
compilation of league-character_sets.adb failed
compilation of matreshka-internals-unicode-ucd-norms.ads failed
compilation of matreshka-internals-unicode-ucd-core.ads failed
gprbuild: *** compilation phase failed
error: Command ["gprbuild", "-s", "-j0", "-p", "-P",
"/coldstorage/gloucester/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this/matreshka_league_21.0.0_0c8f4d47/gnat/matreshka_league.gpr"]
exited with code 4
error: Build failed
/home/gloucester/coldstorage/software_installed/ALIRE/Matreshka_league__ALIRE_failed_to_build_this/matreshka_league_21.0.0_0c8f4d47
$ date
Tue Aug 26 12:03:12 CEST 2025
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-08-31 17:39 ` Ada 202x; 2022; and " Nicolas Paul Colin de Glocester
@ 2025-08-31 21:23 ` Kevin Chadwick
2025-08-31 21:27 ` Nicolas Paul Colin de Glocester
2025-09-02 16:01 ` Alex // nytpu
1 sibling, 1 reply; 27+ messages in thread
From: Kevin Chadwick @ 2025-08-31 21:23 UTC (permalink / raw)
Most languages only support working in one encoding. Go UTF-8 and Dart
Utf-16. Perhaps Ada was too ambitious but wide_wide worked for me when I
needed it. Finalising support is a potential aim of the next standard.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-08-31 21:23 ` Kevin Chadwick
@ 2025-08-31 21:27 ` Nicolas Paul Colin de Glocester
0 siblings, 0 replies; 27+ messages in thread
From: Nicolas Paul Colin de Glocester @ 2025-08-31 21:27 UTC (permalink / raw)
Dear Mister Chadwick,
Thanks for this contribution.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-08-31 17:39 ` Ada 202x; 2022; and " Nicolas Paul Colin de Glocester
2025-08-31 21:23 ` Kevin Chadwick
@ 2025-09-02 16:01 ` Alex // nytpu
2025-09-02 17:40 ` Nicolas Paul Colin de Glocester
` (3 more replies)
1 sibling, 4 replies; 27+ messages in thread
From: Alex // nytpu @ 2025-09-02 16:01 UTC (permalink / raw)
I've written about this at length before because it's a major pain
point; but I can't find any of my old writing on it so I've rewritten it
here lol. I go into extremely verbose detail on all the recommendations
and the issues at play below, but to summarize:
- You really should use Unicode both in storage/interchange and internally
- Use Wide_Wide_<> internally everywhere in your program
- Use Ada's Streams facility to read/write external text as binary,
transcoding it manually using UTF_Encoding (or custom implemented
routines if you need non-Unicode encodings)
- You can use Text_Streams to get a binary stream even from
stdin/stdout/stderr, although with some annoying caveats regarding
Text_IO adding spurious end-of-file newlines when writing
- Be careful with string functions that inspect the contents of strings
even for Wide_Wide_Strings, because Unicode can have tricky issues
(basically, just only ever look for/split on/etc. hardcoded valid
sequences/characters due to issues with multi-codepoint graphemes)
***
Right off the bat, in modern code either on its own or interfacing with
other modern code, you really should use Unicode, and really really
should use UTF-8. If you use Latin-1 or Windows-1252 or some weird
regional encoding everyone will hate you, and if you restrict inputs to
7-bit ASCII everyone will hate you too lol. And people will get annoyed
if you use UTF-16 or UTF-32 instead of UTF-8 as the interchange/storage
format in a new program.
But first, looking at how you deal with text internally with your
program, you *really* have two options (technically there's more but the
others are not good): storing UTF-8 with Strings (you have to use a
String even for individual characters), or storing UTF-32 in
Wide_Wide_String/Wide_Wide_Characters.
When storing UTF-8 in a String (for good practice, use the
Ada.Strings.UTF_Encoding.UTF_8_String subtype just to indicate that it
is UTF-8 and not Latin-1), the main thing is you can't use or have to be
very cautious (and really should just avoid if possible) using any of
the built-in String/Unbounded_String utilities that inspect or
manipulate the contents of text.
With Wide_Wide_<>, you're technically wasting 11 out of every 32 bits of
memory for alignment reasons---or 24 out of 32 bits with text that's
mostly ASCII with only the occasional higher character---but eh, not
that big a deal *on modern systems capable of running a modern hosted
environment*. Note that there is zero chance in hell that UTF-32 will
ever be adopted as an interchange or storage encoding (except in
isolated singular corporate apps *maybe*), so UTF-32 being used should
purely be an internal implementation detail: incoming text in whatever
encoding gets converted to it and outgoing text will always get
converted from it. And you should only convert at the I/O "boundary",
don't have half of your program dealing with native string encoding and
half dealing with Wide_Wide_<> (with the only exception being that if
you don't need to look at the string's contents and are just passing it
through, then you can and should avoid transcoding at all).
I personally use Wide_Wide_<> for everything just because it's more
convenient to have more useful built-in string functions, and it makes
dealing with input/output encoding much easier later (detailed below).
I would never use Wide_<> unless you're exclusively targeting Windows or
something, because UTF-16 is just inconvenient and has none of the
benefits of UTF-8 nor any of the benefits of UTF-32 and most of the
downsides of both. Plus since Ada standardized wide characters so early
there's additional fuckups relating to UCS-2---UTF-16 incompatibilities
like Windows has[1] and you absolutely do not want to deal with that.
I'm unfortunate enough to know most of the nuances of Unicode but I
won't subject you to it, but a lot of the statements in your collection
are a bit oversimplified (UCS-4 has a number of additional differences
from UTF-32 regarding "valid encodings", namely that all valid Unicode
codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only
Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive) are
valid in UTF-32), and are missing some additional information: a key
detail is that even with UTF-32 where each Unicode scalar value is held
in one array element rather than being variable-width like UTF-8/UTF-16,
you still can't treat them as arbitrary arrays like 7-bit ASCII because
a grapheme can be made up of multiple Unicode scalar values. Even with
ASCII characters there's the possibility of combining diacritics or such
that would break if you split the string between them.
Also, I just stumbled across Ada.Strings.Text_Buffers which seems to be
new to Ada 2022, makes "string builder" stuff much more convenient
because you can write text using any of Ada's string types and then get
a string in whatever encoding you want (and with the correct
system-specific line endings which is a whole 'nother issue with Ada
strings) out of it instead of needing to fiddle with all that manually,
maybe that'll be useful if you can use Ada 2022.
***
Okay, so I've discussed the internal representation and issues with
that, but now we get into input/output transcoding... this is just a
nightmare in Ada, one almost decent solution but even it has caveats and
bugs, uggh.
In general, just the Text_IO packages will always transcode the input
file to whatever format you're getting and transcode your given output
to some other format, and it's annoying to configure what encoding is
used at compile time[2] and impossible to change at runtime which makes
the Text_IO packages just useless for non-Latin-1/ASCII IMO. Even if
you get GNAT whipped into shape for your codebase's needs you're
abandoning all portability should a hypothetical second Ada
implementation that you might want to use arise.
The only way to get full control of the input and output encodings is to
use one of Ada's ways of performing binary I/O and then manually convert
strings to binary yourself. I personally prefer using Streams over
Sequential_IO/Direct_IO, using UTF_Encoding (or the new Text_Buffers) to
convert to/from the specific format I want before reading or writing
from the stream.
There is one singular bug though: if you use Ada.Text_IO.Text_Streams to
get a byte stream from an Text_IO output file (the only way to
read/write binary data from stdin, stdout, and stderr at all), then
after writing and the file is closed, an extra newline will always be
added. The Ada standard requires that Text_IO always output a newline
if the output didn't end with one, and the stream from text_streams
completely bypasses all of the Text_IO package's bookkeeping, so from
its perspective nothing was written to the file (let alone a newline) so
it has to add a newline.[3] So you either just have to deal with output
files having an empty trailing line or make sure to strip off the final
newline from the text you're outputting.
***
Sorry for it being so long, but that's the horror of working with text
XD, particularly older things like Ada that didn't have the benefit of
modern hindsight for how text encoding would end up and had to bolt on
solutions afterwards that just doesn't work right. Although at least
Ada is better than the unfixable un-work-aroundable C/C++ nightmare[4]
or Windows or really any software created prior to Unicode 1.1 (1993).
~nytpu
[1]: https://wtf-8.codeberg.page/#ill-formed-utf-16
[2]: The problem is GNAT completely changes how the Text_IO packages
behave with regards to text encoding through opaque methods. The
encodings used by Text_IO are mostly (but not entirely) based off of the
`-gnatW` flag, which is configuring the encoding of THE PROGRAM'S SOURCE
CODE. Absolutely batshit they abused the source file encoding flag as
the only way for the programmer to configure what encoding the program
reads and writes, which is completely orthogonal to the source code.
[3]: When I was more active on IRC, either Lucretia or Shark8 (who you
both quoted) would whine about this every chance possible lol. It is
extremely annoying even when you use Text_IO directly rather than
through streams, because it's messing with my damn file even when I
didn't ask it to.
[4]: https://github.com/mpv-player/mpv/commit/1e70e82baa91
--
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 16:01 ` Alex // nytpu
@ 2025-09-02 17:40 ` Nicolas Paul Colin de Glocester
2025-09-02 18:49 ` Keith Thompson
2025-09-02 17:42 ` Nicolas Paul Colin de Glocester
` (2 subsequent siblings)
3 siblings, 1 reply; 27+ messages in thread
From: Nicolas Paul Colin de Glocester @ 2025-09-02 17:40 UTC (permalink / raw)
[-- Attachment #1: Type: text/plain, Size: 5113 bytes --]
Alex // nytpu wrote during this decade, specifically today:
|---------------------------------------------------------------------------|
|"I can't find any of my old writing on it so I've rewritten it |
|here lol." |
|---------------------------------------------------------------------------|
Dear Alex:
A teammate had once solved a problem but he had forgotten how he
solved it. So he has queried a search engine. So it showed him a
webpage with a perfect solution --- a webpage written by him!
I recommend searching for that old writing about Unicode: perhaps it
has more details than this comp.lang.ada thread, or perhaps a
perspective has been changed in an interesting way. Even if there is
no difference, perhaps it is in a directory with other missing files
which need to be backed up!
|---------------------------------------------------------------------------|
|"If you use Latin-1 or Windows-1252 or some weird |
|regional encoding everyone will hate you, and if you restrict inputs to |
|7-bit ASCII everyone will hate you too lol. And people will get annoyed |
|if you use UTF-16 or UTF-32 instead of UTF-8 as the interchange/storage |
|format in a new program." |
|---------------------------------------------------------------------------|
I quote Usenet articles in a way which does not endear me to
persons. Not everyone reacts in the same way. OC Systems asked me how
do I draw those boxes.
I advocate Ada which also does not endear me to persons.
|---------------------------------------------------------------------------|
|"[. . .] |
| |
|I personally use Wide_Wide_<> for everything just because it's more |
|convenient to have more useful built-in string functions, and it makes |
|dealing with input/output encoding much easier later (detailed below). |
| |
|[. . .] |
| |
|I'm unfortunate enough to know most of the nuances of Unicode but I |
|won't subject you to it, but a lot of the statements in your collection |
|are a bit oversimplified (UCS-4 has a number of additional differences |
|from UTF-32 regarding "valid encodings", [. . .] |
|[. . .]" |
|---------------------------------------------------------------------------|
Thanks for this feedback and more will be as welcome as can be. I
quoted examples of what I found in this newsgroup. This newsgroup used
not have many statements with explicit references to "UTF-32" or
"UTF32" or "UCS-4" which differ overwhelmingly from what I quoted
during the previous week.
|---------------------------------------------------------------------------|
|"Also, I just stumbled across Ada.Strings.Text_Buffers which seems to be |
|new to Ada 2022, makes "string builder" stuff much more convenient |
|because you can write text using any of Ada's string types and then get |
|a string in whatever encoding you want [. . .] |
|[. . .]" |
|---------------------------------------------------------------------------|
Package Ada.Strings.Text_Buffers does not support UCS-4.
|---------------------------------------------------------------------------|
|"Note that there is zero chance in hell that UTF-32 will ever be adopted as|
|an interchange or storage encoding (except in isolated singular corporate |
|apps *maybe*), so UTF-32 being used should purely be an internal |
|implementation detail: incoming text in whatever encoding gets converted to|
|it and outgoing text will always get converted from it." |
|---------------------------------------------------------------------------|
One can know but what one can too optimistically know can be
false. Character sets or encodings used to be subjects of unfulfilled
expectations.
I can say that for now, UTF-8 is enough for a particular application.
Deadly Head did not have the same luck.
|---------------------------------------------------------------------------|
|"The encodings used by |
|Text_IO are mostly (but not entirely) based off of the `-gnatW` flag, which|
|is configuring the encoding of THE PROGRAM'S SOURCE CODE." |
|---------------------------------------------------------------------------|
GNAT has many switches. It could easily gain more switches.
Sincères salutations.
Nicolas Paul Colin de Glocester
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 16:01 ` Alex // nytpu
2025-09-02 17:40 ` Nicolas Paul Colin de Glocester
@ 2025-09-02 17:42 ` Nicolas Paul Colin de Glocester
2025-09-02 19:15 ` Alex // nytpu
2025-09-02 18:08 ` Dmitry A. Kazakov
2025-09-02 22:56 ` Lawrence D’Oliveiro
3 siblings, 1 reply; 27+ messages in thread
From: Nicolas Paul Colin de Glocester @ 2025-09-02 17:42 UTC (permalink / raw)
The first endnote (i.e.
[1]: https://wtf-8.codeberg.page/#ill-formed-utf-16
) in news:10974d1$jn0e$1@dont-email.me
is not reproduced in
HTTPS://nytpu.com/gemlog/2025-09-02
I do not know if that is intentional. Thanks for saying "It has an amusing
large collection of quotes" on that webpage.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 16:01 ` Alex // nytpu
2025-09-02 17:40 ` Nicolas Paul Colin de Glocester
2025-09-02 17:42 ` Nicolas Paul Colin de Glocester
@ 2025-09-02 18:08 ` Dmitry A. Kazakov
2025-09-02 19:13 ` Alex // nytpu
2025-09-02 22:56 ` Lawrence D’Oliveiro
3 siblings, 1 reply; 27+ messages in thread
From: Dmitry A. Kazakov @ 2025-09-02 18:08 UTC (permalink / raw)
On 2025-09-02 18:01, Alex // nytpu wrote:
> I've written about this at length before because it's a major pain
> point; but I can't find any of my old writing on it so I've rewritten it
> here lol.
The matter is quite straightforward:
1. Never ever use Wide and Wide_Wide. There is a marginal case of
Windows API where you need Wide_String for UTF-16 encoding. Otherwise,
use cases are absent. No text processing algorithms require code point
access.
2. Use Character as octet. String as UTF-8 encoded.
That is all.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 17:40 ` Nicolas Paul Colin de Glocester
@ 2025-09-02 18:49 ` Keith Thompson
2025-09-02 19:27 ` Nicolas Paul Colin de Glocester
0 siblings, 1 reply; 27+ messages in thread
From: Keith Thompson @ 2025-09-02 18:49 UTC (permalink / raw)
Nicolas Paul Colin de Glocester <Spamassassin@irrt.De> writes:
> Alex // nytpu wrote during this decade, specifically today:
[...]
> |---------------------------------------------------------------------------|
> |"If you use Latin-1 or Windows-1252 or some weird |
[snip]
> |format in a new program." |
> |---------------------------------------------------------------------------|
>
> I quote Usenet articles in a way which does not endear me to
> persons. Not everyone reacts in the same way. OC Systems asked me how
> do I draw those boxes.
Why do you do that? It seems like a lot of effort to produce an
annoying result.
[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 18:08 ` Dmitry A. Kazakov
@ 2025-09-02 19:13 ` Alex // nytpu
0 siblings, 0 replies; 27+ messages in thread
From: Alex // nytpu @ 2025-09-02 19:13 UTC (permalink / raw)
On 9/2/25 12:08 PM, Dmitry A. Kazakov wrote:
> The matter is quite straightforward:
Objectively false, "text" is never actually straightforward despite what
it seems like on a surface level :P
> 1. Never ever use Wide and Wide_Wide. There is a marginal case of
> Windows API where you need Wide_String for UTF-16 encoding. Otherwise,
> use cases are absent. No text processing algorithms require code point
> access.
Somewhat inclined to agree with Wide_<> but I don't see strong
justification to *never* use Wide_Wide_<>, there's pretty substantial
tradeoffs to both using UTF-32 and UTF-8 (in any programming language
that supports both, but particularly with Ada's string situation) so
unfortunately it ultimately falls on the programmer to understand and
choose.
> 2. Use Character as octet. String as UTF-8 encoded.
Perfectly valid, explicitly mentioned as an option in my post. Maybe
actually would be better for most applications because they wouldn't
need to transcode it, I should've noted that more clearly in my original
response. The only two issues: make sure to avoid the Latin-1 String
routines unless you know you're doing is sound; and in older Ada
versions I remember reading long debates about the String type may not
be able to safely store UTF-8 on many compilers (of the era), but that
issue was clarified by even Ada 95 IIRC.
I just personally prefer Wide_Wide_<> to get its slightly more
Unicode-aware string routines, but it's not the only (or even inherently
the best) option.
~nytpu
--
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 17:42 ` Nicolas Paul Colin de Glocester
@ 2025-09-02 19:15 ` Alex // nytpu
2025-09-02 19:50 ` Nicolas Paul Colin de Glocester
0 siblings, 1 reply; 27+ messages in thread
From: Alex // nytpu @ 2025-09-02 19:15 UTC (permalink / raw)
On 9/2/25 11:42 AM, Nicolas Paul Colin de Glocester wrote:
> The first endnote (i.e.
> [1]: https://wtf-8.codeberg.page/#ill-formed-utf-16
> ) in news:10974d1$jn0e$1@dont-email.me
> is not reproduced in
> HTTPS://nytpu.com/gemlog/2025-09-02
> I do not know if that is intentional.
I just converted the footnote to an inline link since HTML supports it
while plaintext posts don't.
> Thanks for saying "It has an amusing large collection of quotes" on that webpage.
It is a very thorough collection, I liked it.
(Also I didn't think to ask before posting your original message or my
reply to my website, sorry. I'll take it down if you don't want it
rehosted like that)
~nytpu
--
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 18:49 ` Keith Thompson
@ 2025-09-02 19:27 ` Nicolas Paul Colin de Glocester
2025-09-02 20:02 ` Keith Thompson
0 siblings, 1 reply; 27+ messages in thread
From: Nicolas Paul Colin de Glocester @ 2025-09-02 19:27 UTC (permalink / raw)
[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]
On Tue, 2 Sep 2025, Keith Thompson wrote:
"> I quote Usenet articles in a way which does not endear me to
> persons. Not everyone reacts in the same way. OC Systems asked me how
> do I draw those boxes.
Why do you do that?"
Such a quoting style is correlated with a possibly misguided perception
that a language does not have a quotation mark at the beginning of each
intermediate line. Indications that this perception is misguided are
English documents which are supposedly from decades before Ada 83 which do
indeed show a "“" (i.e. an English opening quotation mark) at the
beginning of each intermediate line.
However I am not interested enough in English and I do not have enough
time to investigate whether or not that is the real way to quote in
English. If one could show me an authoriative document older than the 20th
century on how to write in English which declares so, then it might nudge
me.
I had not originally believed that drawing rectangles for embedded
quotations is annoying, as others used to draw so before me. However,
unfortunately these rectangles clearly annoy Mister Thompson. Sorry!
" It seems like a lot of effort to produce an
annoying result."
No effort! As I wrote to OC Systems on
Date: Wed, 2 Jul 2008 16:34:41 -0400 (EDT)
long after I wrote an Emacs-Lisp code for these quotations:
"Thank you for asking. At least so far as I have noticed, you are the
first person to have asked me that even though I have been using them
since last year. They are largely created by an Emacs Lisp function
which I wrote (see far below) to save me labor, [. . .]
[. . .]
[. . .] (Emacs Lisp is terrible, but it is commonly available on
email servers and I was using a buggy Common Lisp program at the time
so I thought that drawing the boxes in Emacs Lisp might serve as some
practice for bug fixing in Common Lisp.)
[. . .]"
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 19:15 ` Alex // nytpu
@ 2025-09-02 19:50 ` Nicolas Paul Colin de Glocester
0 siblings, 0 replies; 27+ messages in thread
From: Nicolas Paul Colin de Glocester @ 2025-09-02 19:50 UTC (permalink / raw)
[-- Attachment #1: Type: text/plain, Size: 1256 bytes --]
On Tue, 2 Sep 2025, Alex // nytpu wrote:
"It is a very thorough collection,"
Dear Alex,
False. This collection does not quote all the comp.lang.ada articles
referring to "UTF-32" or "UTF32" or "UCS-4" etc. that I read during the
previous week, but there is largely no difference in substance in the ones
that I read during the previous week that I decide to not quote. So as to
have a good Subject: header, I had quite some job deciding which article
to press the reply button on. I wanted to reply to
Subject: Re: Supporting full Unicode
but I did not because I did not actually quote anything from that thread.
On Tue, 2 Sep 2025, Alex // nytpu wrote:
"I liked it."
Thanks and welcome.
On Tue, 2 Sep 2025, Alex // nytpu wrote:
"(Also I didn't think to ask before posting your original message or my
reply to
my website,"
No need to ask so.
On Tue, 2 Sep 2025, Alex // nytpu wrote:
"I'll take it down if you don't want it rehosted like
that)"
I do not oppose rehosting it.
Actually, though
HTTPS://Usenet.Ada-Lang.IO/comp.lang.ada
is excellent and better than all other comp.lang.ada archives, it
unfortunately lacks a few non-spam posts.
Sincères salutations.
Nicolas Paul Colin de Glocester
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 19:27 ` Nicolas Paul Colin de Glocester
@ 2025-09-02 20:02 ` Keith Thompson
0 siblings, 0 replies; 27+ messages in thread
From: Keith Thompson @ 2025-09-02 20:02 UTC (permalink / raw)
Nicolas Paul Colin de Glocester <Spamassassin@irrt.De> writes:
> On Tue, 2 Sep 2025, Keith Thompson wrote:
> "> I quote Usenet articles in a way which does not endear me to
>> persons. Not everyone reacts in the same way. OC Systems asked me how
>> do I draw those boxes.
>
> Why do you do that?"
>
> Such a quoting style is correlated with a possibly misguided
> perception that a language does not have a quotation mark at the
> beginning of each intermediate line. Indications that this perception
> is misguided are English documents which are supposedly from decades
> before Ada 83 which do
> indeed show a "“" (i.e. an English opening quotation mark) at the
> beginning of each intermediate line.
[...]
I urge you to adopt the universal Usenet convention of preceding
quoted text from previous articles with "> ". See every other
followup article in this and other newsgroups for examples.
Everyone understands it, virtually everyone but you uses it, and
(almost?) every Usenet client fully supports it. I can think of no
reason not to use it, unless your goal is for your posts to stand
out from others in a rather unpleasant way.
I mentioned before that your scheme seemed like a lot of effort.
In fact the level of effort is irrelevant.
You will of course do what you want, and I don't intend to discuss
it further.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 16:01 ` Alex // nytpu
` (2 preceding siblings ...)
2025-09-02 18:08 ` Dmitry A. Kazakov
@ 2025-09-02 22:56 ` Lawrence D’Oliveiro
2025-09-03 0:20 ` Alex // nytpu
3 siblings, 1 reply; 27+ messages in thread
From: Lawrence D’Oliveiro @ 2025-09-02 22:56 UTC (permalink / raw)
On Tue, 2 Sep 2025 10:01:34 -0600, Alex // nytpu wrote:
> ... (UCS-4 has a number of additional differences from UTF-32
> regarding "valid encodings", namely that all valid Unicode
> codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only
> Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive)
> are valid in UTF-32) ...
So what do those codes mean in UCS-4?
> ... and are missing some additional information: a key detail is
> that even with UTF-32 where each Unicode scalar value is held in one
> array element rather than being variable-width like UTF-8/UTF-16,
> you still can't treat them as arbitrary arrays like 7-bit ASCII
> because a grapheme can be made up of multiple Unicode scalar values.
> Even with ASCII characters there's the possibility of combining
> diacritics or such that would break if you split the string between
> them.
This is why you have “normalization”.
<https://www.unicode.org/faq/char_combmark.html>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-02 22:56 ` Lawrence D’Oliveiro
@ 2025-09-03 0:20 ` Alex // nytpu
2025-09-03 4:10 ` Lawrence D’Oliveiro
0 siblings, 1 reply; 27+ messages in thread
From: Alex // nytpu @ 2025-09-03 0:20 UTC (permalink / raw)
On 9/2/25 4:56 PM, Lawrence D’Oliveiro wrote:
> On Tue, 2 Sep 2025 10:01:34 -0600, Alex // nytpu wrote:
>> ... (UCS-4 has a number of additional differences from UTF-32
>> regarding "valid encodings", namely that all valid Unicode
>> codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only
>> Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive)
>> are valid in UTF-32) ...
>
> So what do those codes mean in UCS-4?
Unfortunately, here's where you get more complexity. So there's a
difference between a valid codepoint/scalar value and an assigned scalar
value. The vast majority of valid scalar values are unassigned
(currently 154,998 characters are standardized out of 1,114,112 possible
characters), but everything other than text renderers and normalizers
should handle them like any other character to allow for at least some
level of forwards compatibility when new characters are added.
So in UCS-4 (or any UCS-<>) implementation, they're just treated like
unassigned codepoints (that will never be assigned, not that they'd
know); while they're completely invalid and should not be represented at
all in UTF-32. Implementations should either error out or replace it
with the substitution character U+FFFD in order to ensure that it's
always working with valid UTF-32 (this is what makes the Windows
character set and Ada's Wide_Strings messy, because they were originally
standardized before UTF-16 so to keep backwards compatibility they still
support unpaired surrogates so you have to sanitize it yourself to avoid
making your UTF-8 encoder or the other software reading your text
declare the encoding invalid).
(This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a
hellish mess caused by extreme lack of foresight and it's horrible they
saddled everyone, including people not using UTF-16, with this crap.
UTF-16 and its surrogate pairs is also what's responsible for the
maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even
though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird
encoding the Chinese government came up with can all trivially encode
full 32-bit values)
> This is why you have “normalization”.
> <https://www.unicode.org/faq/char_combmark.html>
Still can't just arbitrarily split strings without being careful, there
are characters that are inherently multi-codepoint (e.g. most emoji
among others) without the possibility to be reduced to a single
codepoint like some can. Really, unfortunately, with Unicode you really
just shouldn't try to make use of an "array" of any fixed-size quantity
because with multi-codepoint graphemes and combining characters and such
it's just not possible.
Plus conveniently Ada doesn't have routines for normalization, but can't
hold that against it since neither does any other programming language
because the lookup tables required are like 20 MiB even when optimized
for space. (Everyone says to just link to libicu, which also lets you
get out of needing to keep your program's Unicode tables up-to-date when
a new Unicode version releases)
Plus you shouldn't normalize text other than performing actions like
substring matching, equality tests, or sorting---and even if you
normalize when performing those, *when possible* you should store the
unnormalized original for display/output afterwards. Normalization
causes lots of semantic information loss because many distinct
characters are mapped onto one (e.g. non-breaking spaces and zero-width
spaces are mapped to plain space, mathematical font variants and
superscripts are mapped to the plain Latin/Greek versions, many
different languages' characters are mapped to one if the characters
happen to be visually similar, etc. etc.).
~nytpu
--
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-03 0:20 ` Alex // nytpu
@ 2025-09-03 4:10 ` Lawrence D’Oliveiro
2025-09-03 17:25 ` Alex // nytpu
0 siblings, 1 reply; 27+ messages in thread
From: Lawrence D’Oliveiro @ 2025-09-03 4:10 UTC (permalink / raw)
On Tue, 2 Sep 2025 18:20:09 -0600, Alex // nytpu wrote:
> (This whole mess is because UCS-2 and the UCS-2->UTF-16 transition was a
> hellish mess caused by extreme lack of foresight and it's horrible they
> saddled everyone, including people not using UTF-16, with this crap.
I gather the basic problem was that Unicode was originally going to be a
fixed-length 16-bit code, and that was that. And so early adopters
(Windows NT and Java among them), built UCS-2 right into their DNA.
Until Unicode 2.0, I believe it was, where they went “on second thought,
let’s go beyond our original brief and start including all kinds of other
things as well” ... and UCS-2 had to become UTF-16 ...
> UTF-16 and its surrogate pairs is also what's responsible for the
> maximum scalar value being 0x0010_FFFF instead of 0xFFFF_FFFF even
> though UTF-32 and UTF-8 and even goddamn UTF-EBCDIC and the weird
> encoding the Chinese government came up with can all trivially encode
> full 32-bit values)
I wondered about that limit ...
> Plus conveniently Ada doesn't have routines for normalization, but can't
> hold that against it since neither does any other programming language
> because the lookup tables required are like 20 MiB even when optimized
> for space.
I think Python has them
<https://docs.python.org/3/library/unicodedata.html>. But then, on
platforms with decent package management, that data can be shared with
other installed packages that require it as well.
> Plus you shouldn't normalize text other than performing actions like
> substring matching, equality tests, or sorting---and even if you
> normalize when performing those, *when possible* you should store the
> unnormalized original for display/output afterwards.
I thought it was always safe to store decomposed versions of everything.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
2025-09-03 4:10 ` Lawrence D’Oliveiro
@ 2025-09-03 17:25 ` Alex // nytpu
0 siblings, 0 replies; 27+ messages in thread
From: Alex // nytpu @ 2025-09-03 17:25 UTC (permalink / raw)
On 9/2/25 10:10 PM, Lawrence D’Oliveiro wrote:
> I gather the basic problem was that Unicode was originally going to be a
> fixed-length 16-bit code, and that was that. And so early adopters
> (Windows NT and Java among them), built UCS-2 right into their DNA.
>
> Until Unicode 2.0, I believe it was, where they went “on second thought,
> let’s go beyond our original brief and start including all kinds of other
> things as well” ... and UCS-2 had to become UTF-16 ...
Yeah, they started with UCS-2 (as the only encoding) because they
thought that 2^16 characters would be enough but then a few years later
realized they'd run out extremely quickly even sticking solely with
actively-used languages and with the very controversial Han unification,
so they had to hack together the surrogate pairs to allow multiple
planes (and at the same time they were developing UTF-8 for its
desirable compatibility with 7-bit ASCII so they had to stick with
UTF-16's limitations since the other encodings came later).
>> Plus conveniently Ada doesn't have routines for normalization, but can't
>> hold that against it since neither does any other programming language
>> because the lookup tables required are like 20 MiB even when optimized
>> for space.
>
> I think Python has them
> <https://docs.python.org/3/library/unicodedata.html>. But then, on
> platforms with decent package management, that data can be shared with
> other installed packages that require it as well.
Yeah, although it's a language that is expected to have one global
runtime used by everything; anything that's compiled (with or without a
bundled runtime, e.g. Go) doesn't want to impose a mandatory 20 MiB
overhead in every executable for something that's you can *usually* get
away with not using (see also the LUTs for Unicode character classes).
>> Plus you shouldn't normalize text other than performing actions like
>> substring matching, equality tests, or sorting---and even if you
>> normalize when performing those, *when possible* you should store the
>> unnormalized original for display/output afterwards.
>
> I thought it was always safe to store decomposed versions of everything.
Well, it depends; storing decomposed (NFD, NFKD) versions is acceptable
IIRC (maybe not because I think it still does some limited substitution
for "visually similar" characters, just less extreme) but usually
pointless if you don't need to inspect the contents. Or if you're
storing like, a search index, then also yeah you should store normalized
(NFC, NFKC) versions of strings. But in general just keep the original
form of things unless you need to inspect/compare the contents (and if
you don't need to regularly inspect the contents then just convert it
when needed instead of storing the normalized versions).
Just my opinion though, there's arguments either way, I just don't like
needlessly messing with the semantics of the input data.
~nytpu
--
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-09-03 17:25 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
2010-08-21 6:21 ` Dmitry A. Kazakov
2010-08-21 7:01 ` J-P. Rosen
2010-08-21 8:12 ` Yannick Duchêne (Hibou57)
2010-08-22 18:51 ` J-P. Rosen
2010-08-22 19:48 ` Georg Bauhaus
2010-08-22 20:40 ` J-P. Rosen
2010-08-23 10:32 ` Georg Bauhaus
2010-08-23 22:28 ` Randy Brukardt
2025-08-31 17:39 ` Ada 202x; 2022; and " Nicolas Paul Colin de Glocester
2025-08-31 21:23 ` Kevin Chadwick
2025-08-31 21:27 ` Nicolas Paul Colin de Glocester
2025-09-02 16:01 ` Alex // nytpu
2025-09-02 17:40 ` Nicolas Paul Colin de Glocester
2025-09-02 18:49 ` Keith Thompson
2025-09-02 19:27 ` Nicolas Paul Colin de Glocester
2025-09-02 20:02 ` Keith Thompson
2025-09-02 17:42 ` Nicolas Paul Colin de Glocester
2025-09-02 19:15 ` Alex // nytpu
2025-09-02 19:50 ` Nicolas Paul Colin de Glocester
2025-09-02 18:08 ` Dmitry A. Kazakov
2025-09-02 19:13 ` Alex // nytpu
2025-09-02 22:56 ` Lawrence D’Oliveiro
2025-09-03 0:20 ` Alex // nytpu
2025-09-03 4:10 ` Lawrence D’Oliveiro
2025-09-03 17:25 ` Alex // nytpu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox