From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: Why UTF-8 (was Re: Lower bounds of Strings)
Date: Tue, 12 Jan 2021 01:58:46 -0600 [thread overview]
Message-ID: <rtjkrn$kf1$1@franka.jacob-sparre.dk> (raw)
In-Reply-To: rtcre8$1bug$1@gioia.aioe.org
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:rtcre8$1bug$1@gioia.aioe.org...
> On 2021-01-09 15:52, Jeffrey R. Carter wrote:
>> On 1/9/21 3:31 AM, Randy Brukardt wrote:
>>> The default String should be UTF-8, the others should be reserved for
>>> special cases (interfacing in particular). You don't want the default
>>> string
>>> type to restrict the contents, and you don't want it to waste a lot of
>>> space.
>>
>> I don't understand this. I presume there was a time when the extra
>> complexity of UTF-8 was a reasonable price to pay for the larger than
>> 1-byte character range it provided, and there may be systems where it
>> still makes sense, but with most systems these days having GB of memory
>> and TB of storage, the simplicity of using 2 bytes per character seems
>> worth the wasted space. On my 4-yr-old computer I could do everything
>> with 4-byte characters and not have a problem.
>
> Because there is no complexity in UTF-8. String characters are always
> accessed consequently. So UCS-4 has no advantage over UTF-8.
I wouldn't go so far as to say *no* complexity, but the cases where the
complexity is a major issue are fairly rare. As Dmitry says, most operations
are scans, and UTF-8 was designed so that scans don't need to identify the
starts and ends of characters (you can't get mismatches when doing pattern
matching, for instance).
And wasting memory will remain an issue for the foreseeable future. The
amount of data structures that fit in the various caches have a substantial
effect on performance. Similarly, the amount of data read/written to
disk/nonvolatile memory also has a big effect on the cost of those
operations. While a programmer can ignore those issues (avoiding premature
optimization is usually a good thing), it's not as clear that a programming
language design can. If the default representation is slow and large, there
can be costs in switching to a better representation (see Ada's support for
UTF-8 for an example!), and also in the perception of the language.
Randy.
next prev parent reply other threads:[~2021-01-12 7:58 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-05 11:04 Lower bounds of Strings Stephen Davies
2021-01-05 11:57 ` Dmitry A. Kazakov
2021-01-05 12:32 ` Jeffrey R. Carter
2021-01-05 13:40 ` Dmitry A. Kazakov
2021-01-05 14:31 ` Stephen Davies
2021-01-05 17:24 ` Stephen Davies
2021-01-05 18:28 ` Jeffrey R. Carter
2021-01-05 21:02 ` Stephen Davies
2021-01-07 10:38 ` Stephen Davies
2021-01-07 21:39 ` Randy Brukardt
2021-01-07 22:38 ` Stephen Davies
2021-01-05 12:24 ` Luke A. Guest
2021-01-05 12:49 ` Simon Wright
2021-01-05 12:51 ` Jeffrey R. Carter
2021-01-06 3:08 ` Randy Brukardt
2021-01-06 9:13 ` Dmitry A. Kazakov
2021-01-07 0:17 ` Randy Brukardt
2021-01-07 9:57 ` Dmitry A. Kazakov
2021-01-07 22:03 ` Randy Brukardt
2021-01-08 9:04 ` Dmitry A. Kazakov
2021-01-08 17:23 ` Shark8
2021-01-08 20:19 ` Dmitry A. Kazakov
2021-01-09 2:18 ` Randy Brukardt
2021-01-09 10:53 ` Dmitry A. Kazakov
2021-01-12 8:19 ` Randy Brukardt
2021-01-12 9:37 ` Dmitry A. Kazakov
2021-01-09 2:31 ` Randy Brukardt
2021-01-09 14:52 ` Why UTF-8 (was Re: Lower bounds of Strings) Jeffrey R. Carter
2021-01-09 18:08 ` Dmitry A. Kazakov
2021-01-12 7:58 ` Randy Brukardt [this message]
2021-01-11 21:35 ` Lower bounds of Strings Shark8
2021-01-12 8:12 ` Randy Brukardt
2021-01-12 20:51 ` Shark8
2021-01-12 22:56 ` Randy Brukardt
2021-01-13 12:00 ` Dmitry A. Kazakov
2021-01-13 13:27 ` AdaMagica
2021-01-13 13:53 ` Dmitry A. Kazakov
2021-01-13 14:08 ` Jeffrey R. Carter
2021-01-14 11:38 ` AdaMagica
2021-01-14 12:27 ` Dmitry A. Kazakov
2021-01-14 13:31 ` AdaMagica
2021-01-14 14:02 ` Jeffrey R. Carter
2021-01-14 14:34 ` Dmitry A. Kazakov
2021-01-14 15:28 ` Shark8
2021-01-14 15:41 ` Dmitry A. Kazakov
2021-01-19 21:02 ` G.B.
2021-01-19 22:27 ` Dmitry A. Kazakov
2021-01-20 20:10 ` G.B.
2021-01-20 20:25 ` Dmitry A. Kazakov
2021-01-15 10:24 ` Stephen Davies
2021-01-15 11:41 ` J-P. Rosen
2021-01-15 17:35 ` Stephen Davies
2021-01-15 19:36 ` Egil H H
2021-01-16 12:57 ` Stephen Davies
2021-01-17 14:10 ` Stephen Davies
2021-01-19 5:48 ` Randy Brukardt
2021-01-19 6:13 ` Gautier write-only address
2021-01-15 11:48 ` Jeffrey R. Carter
2021-01-15 13:34 ` Dmitry A. Kazakov
2021-01-15 13:56 ` Stephen Davies
2021-01-15 15:12 ` Jeffrey R. Carter
2021-01-15 17:22 ` Stephen Davies
2021-01-15 21:10 ` Jeffrey R. Carter
2021-01-15 14:00 ` Stephen Davies
2021-01-16 9:30 ` G.B.
2021-01-16 13:13 ` Stephen Davies
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox