From: Alex // nytpu <nytpu@example.invalid>
Subject: Re: Ada 202x; 2022; and 2012 and Unicode package (UTF-nn encodings handling)
Date: Tue, 2 Sep 2025 10:01:34 -0600 [thread overview]
Message-ID: <10974d1$jn0e$1@dont-email.me> (raw)
In-Reply-To: <a8156fc2-bfbd-8199-b440-0ca9192d6936@insomnia247.nl>
I've written about this at length before because it's a major pain
point; but I can't find any of my old writing on it so I've rewritten it
here lol. I go into extremely verbose detail on all the recommendations
and the issues at play below, but to summarize:
- You really should use Unicode both in storage/interchange and internally
- Use Wide_Wide_<> internally everywhere in your program
- Use Ada's Streams facility to read/write external text as binary,
transcoding it manually using UTF_Encoding (or custom implemented
routines if you need non-Unicode encodings)
- You can use Text_Streams to get a binary stream even from
stdin/stdout/stderr, although with some annoying caveats regarding
Text_IO adding spurious end-of-file newlines when writing
- Be careful with string functions that inspect the contents of strings
even for Wide_Wide_Strings, because Unicode can have tricky issues
(basically, just only ever look for/split on/etc. hardcoded valid
sequences/characters due to issues with multi-codepoint graphemes)
***
Right off the bat, in modern code either on its own or interfacing with
other modern code, you really should use Unicode, and really really
should use UTF-8. If you use Latin-1 or Windows-1252 or some weird
regional encoding everyone will hate you, and if you restrict inputs to
7-bit ASCII everyone will hate you too lol. And people will get annoyed
if you use UTF-16 or UTF-32 instead of UTF-8 as the interchange/storage
format in a new program.
But first, looking at how you deal with text internally with your
program, you *really* have two options (technically there's more but the
others are not good): storing UTF-8 with Strings (you have to use a
String even for individual characters), or storing UTF-32 in
Wide_Wide_String/Wide_Wide_Characters.
When storing UTF-8 in a String (for good practice, use the
Ada.Strings.UTF_Encoding.UTF_8_String subtype just to indicate that it
is UTF-8 and not Latin-1), the main thing is you can't use or have to be
very cautious (and really should just avoid if possible) using any of
the built-in String/Unbounded_String utilities that inspect or
manipulate the contents of text.
With Wide_Wide_<>, you're technically wasting 11 out of every 32 bits of
memory for alignment reasons---or 24 out of 32 bits with text that's
mostly ASCII with only the occasional higher character---but eh, not
that big a deal *on modern systems capable of running a modern hosted
environment*. Note that there is zero chance in hell that UTF-32 will
ever be adopted as an interchange or storage encoding (except in
isolated singular corporate apps *maybe*), so UTF-32 being used should
purely be an internal implementation detail: incoming text in whatever
encoding gets converted to it and outgoing text will always get
converted from it. And you should only convert at the I/O "boundary",
don't have half of your program dealing with native string encoding and
half dealing with Wide_Wide_<> (with the only exception being that if
you don't need to look at the string's contents and are just passing it
through, then you can and should avoid transcoding at all).
I personally use Wide_Wide_<> for everything just because it's more
convenient to have more useful built-in string functions, and it makes
dealing with input/output encoding much easier later (detailed below).
I would never use Wide_<> unless you're exclusively targeting Windows or
something, because UTF-16 is just inconvenient and has none of the
benefits of UTF-8 nor any of the benefits of UTF-32 and most of the
downsides of both. Plus since Ada standardized wide characters so early
there's additional fuckups relating to UCS-2---UTF-16 incompatibilities
like Windows has[1] and you absolutely do not want to deal with that.
I'm unfortunate enough to know most of the nuances of Unicode but I
won't subject you to it, but a lot of the statements in your collection
are a bit oversimplified (UCS-4 has a number of additional differences
from UTF-32 regarding "valid encodings", namely that all valid Unicode
codepoints (0x0--0x10FFFF inclusive) are allowed in UCS-4 but only
Unicode scalar values (0x0--0xD7FF and 0xE000--0x10FFFF inclusive) are
valid in UTF-32), and are missing some additional information: a key
detail is that even with UTF-32 where each Unicode scalar value is held
in one array element rather than being variable-width like UTF-8/UTF-16,
you still can't treat them as arbitrary arrays like 7-bit ASCII because
a grapheme can be made up of multiple Unicode scalar values. Even with
ASCII characters there's the possibility of combining diacritics or such
that would break if you split the string between them.
Also, I just stumbled across Ada.Strings.Text_Buffers which seems to be
new to Ada 2022, makes "string builder" stuff much more convenient
because you can write text using any of Ada's string types and then get
a string in whatever encoding you want (and with the correct
system-specific line endings which is a whole 'nother issue with Ada
strings) out of it instead of needing to fiddle with all that manually,
maybe that'll be useful if you can use Ada 2022.
***
Okay, so I've discussed the internal representation and issues with
that, but now we get into input/output transcoding... this is just a
nightmare in Ada, one almost decent solution but even it has caveats and
bugs, uggh.
In general, just the Text_IO packages will always transcode the input
file to whatever format you're getting and transcode your given output
to some other format, and it's annoying to configure what encoding is
used at compile time[2] and impossible to change at runtime which makes
the Text_IO packages just useless for non-Latin-1/ASCII IMO. Even if
you get GNAT whipped into shape for your codebase's needs you're
abandoning all portability should a hypothetical second Ada
implementation that you might want to use arise.
The only way to get full control of the input and output encodings is to
use one of Ada's ways of performing binary I/O and then manually convert
strings to binary yourself. I personally prefer using Streams over
Sequential_IO/Direct_IO, using UTF_Encoding (or the new Text_Buffers) to
convert to/from the specific format I want before reading or writing
from the stream.
There is one singular bug though: if you use Ada.Text_IO.Text_Streams to
get a byte stream from an Text_IO output file (the only way to
read/write binary data from stdin, stdout, and stderr at all), then
after writing and the file is closed, an extra newline will always be
added. The Ada standard requires that Text_IO always output a newline
if the output didn't end with one, and the stream from text_streams
completely bypasses all of the Text_IO package's bookkeeping, so from
its perspective nothing was written to the file (let alone a newline) so
it has to add a newline.[3] So you either just have to deal with output
files having an empty trailing line or make sure to strip off the final
newline from the text you're outputting.
***
Sorry for it being so long, but that's the horror of working with text
XD, particularly older things like Ada that didn't have the benefit of
modern hindsight for how text encoding would end up and had to bolt on
solutions afterwards that just doesn't work right. Although at least
Ada is better than the unfixable un-work-aroundable C/C++ nightmare[4]
or Windows or really any software created prior to Unicode 1.1 (1993).
~nytpu
[1]: https://wtf-8.codeberg.page/#ill-formed-utf-16
[2]: The problem is GNAT completely changes how the Text_IO packages
behave with regards to text encoding through opaque methods. The
encodings used by Text_IO are mostly (but not entirely) based off of the
`-gnatW` flag, which is configuring the encoding of THE PROGRAM'S SOURCE
CODE. Absolutely batshit they abused the source file encoding flag as
the only way for the programmer to configure what encoding the program
reads and writes, which is completely orthogonal to the source code.
[3]: When I was more active on IRC, either Lucretia or Shark8 (who you
both quoted) would whine about this every chance possible lol. It is
extremely annoying even when you use Text_IO directly rather than
through streams, because it's messing with my damn file even when I
didn't ask it to.
[4]: https://github.com/mpv-player/mpv/commit/1e70e82baa91
--
Alex // nytpu
https://nytpu.com/ - gemini://nytpu.com/ - gopher://nytpu.com/
next prev parent reply other threads:[~2025-09-02 16:01 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-20 21:38 Ada 2012 and Unicode package (UTF-nn encodings handling) Yannick Duchêne (Hibou57)
2010-08-20 21:41 ` Yannick Duchêne (Hibou57)
2010-08-21 6:21 ` Dmitry A. Kazakov
2010-08-21 7:01 ` J-P. Rosen
2010-08-21 8:12 ` Yannick Duchêne (Hibou57)
2010-08-22 18:51 ` J-P. Rosen
2010-08-22 19:48 ` Georg Bauhaus
2010-08-22 20:40 ` J-P. Rosen
2010-08-23 10:32 ` Georg Bauhaus
2010-08-23 22:28 ` Randy Brukardt
2025-08-31 17:39 ` Ada 202x; 2022; and " Nicolas Paul Colin de Glocester
2025-08-31 21:23 ` Kevin Chadwick
2025-08-31 21:27 ` Nicolas Paul Colin de Glocester
2025-09-02 16:01 ` Alex // nytpu [this message]
2025-09-02 17:40 ` Nicolas Paul Colin de Glocester
2025-09-02 18:49 ` Keith Thompson
2025-09-02 19:27 ` Nicolas Paul Colin de Glocester
2025-09-02 20:02 ` Keith Thompson
2025-09-02 17:42 ` Nicolas Paul Colin de Glocester
2025-09-02 19:15 ` Alex // nytpu
2025-09-02 19:50 ` Nicolas Paul Colin de Glocester
2025-09-02 18:08 ` Dmitry A. Kazakov
2025-09-02 19:13 ` Alex // nytpu
2025-09-02 22:56 ` Lawrence D’Oliveiro
2025-09-03 0:20 ` Alex // nytpu
2025-09-03 4:10 ` Lawrence D’Oliveiro
2025-09-03 17:25 ` Alex // nytpu
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox