From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: How to read in a (long) UTF-8 file, incrementally?
Date: Tue, 16 Nov 2021 14:23:28 -0600 [thread overview]
Message-ID: <sn1401$ubi$1@franka.jacob-sparre.dk> (raw)
In-Reply-To: sn08jf$pkq$1@gioia.aioe.org
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:sn08jf$pkq$1@gioia.aioe.org...
> On 2021-11-16 12:55, Marius Amado-Alves wrote:
>> I'm worried. I need the concept of character, for proper text processing.
>
> Simply ignore or reject decomposed characters.
Unicode calls that "requiing Normalization Form C". ("Form D" is all
decomposed characters.) You'll note that what Ada compilers do with text not
in Normalization Form C is implementation-defined; in particular, a compiler
could reject such text.
My understanding is that various Internet standards also require
Normalization Form C. For instance, web pages are supposed to always be in
that format. Whether browsers actually enforce that is unknown (they should
enforce a lot of stuff about web pages, but generally just try to muddle
through, which causes all kinds of security issues).
Randy.
next prev parent reply other threads:[~2021-11-16 20:23 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-02 17:42 How to read in a (long) UTF-8 file, incrementally? Marius Amado-Alves
2021-11-02 18:17 ` Dmitry A. Kazakov
2021-11-03 7:43 ` Vadim Godunko
2021-11-03 8:48 ` Luke A. Guest
2021-11-04 11:43 ` Marius Amado-Alves
2021-11-04 12:13 ` Dmitry A. Kazakov
2021-11-04 14:30 ` Luke A. Guest
2021-11-05 10:56 ` Marius Amado-Alves
2021-11-05 19:55 ` Simon Wright
2021-11-16 11:55 ` Marius Amado-Alves
2021-11-16 12:36 ` Dmitry A. Kazakov
2021-11-16 13:52 ` Marius Amado-Alves
2021-11-16 20:23 ` Randy Brukardt [this message]
2021-11-16 15:25 ` Luke A. Guest
2021-11-16 17:38 ` Vadim Godunko
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox