From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.5-pre1 (2020-06-20) on
	ip-172-31-74-118.ec2.internal
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.5-pre1
Path: eternal-september.org!reader02.eternal-september.org!nntp-feed.chiark.greenend.org.uk!ewrotcd!newsfeed.xs3.de!news.jacob-sparre.dk!franka.jacob-sparre.dk!pnx.dk!.POSTED.rrsoftware.com!not-for-mail
From: "Randy Brukardt" <randy@rrsoftware.com>
Newsgroups: comp.lang.ada
Subject: Re: Why UTF-8 (was Re: Lower bounds of Strings)
Date: Tue, 12 Jan 2021 01:58:46 -0600
Organization: JSA Research & Innovation
Message-ID: <rtjkrn$kf1$1@franka.jacob-sparre.dk>
References: <1cc09f04-98f2-4ef3-ac84-9a9ca5aa3fd5n@googlegroups.com> <rt39k8$j8l$1@franka.jacob-sparre.dk> <rt3uv2$1nrd$1@gioia.aioe.org> <rt5jva$7a3$1@franka.jacob-sparre.dk> <rt6ltg$922$1@gioia.aioe.org> <rt80g8$6sf$1@franka.jacob-sparre.dk> <37ada5ff-eee7-4082-ad20-3bd65b5a2778n@googlegroups.com> <rtb4ia$rf1$1@franka.jacob-sparre.dk> <rtcfvn$j33$1@dont-email.me> <rtcre8$1bug$1@gioia.aioe.org>
Injection-Date: Tue, 12 Jan 2021 07:58:47 -0000 (UTC)
Injection-Info: franka.jacob-sparre.dk; posting-host="rrsoftware.com:24.196.82.226";
	logging-data="20961"; mail-complaints-to="news@jacob-sparre.dk"
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-RFC2646: Format=Flowed; Response
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246
Xref: reader02.eternal-september.org comp.lang.ada:61105
List-Id: <comp.lang.ada>

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:rtcre8$1bug$1@gioia.aioe.org...
> On 2021-01-09 15:52, Jeffrey R. Carter wrote:
>> On 1/9/21 3:31 AM, Randy Brukardt wrote:
>>> The default String should be UTF-8, the others should be reserved for
>>> special cases (interfacing in particular). You don't want the default 
>>> string
>>> type to restrict the contents, and you don't want it to waste a lot of
>>> space.
>>
>> I don't understand this. I presume there was a time when the extra 
>> complexity of UTF-8 was a reasonable price to pay for the larger than 
>> 1-byte character range it provided, and there may be systems where it 
>> still makes sense, but with most systems these days having GB of memory 
>> and TB of storage, the simplicity of using 2 bytes per character seems 
>> worth the wasted space. On my 4-yr-old computer I could do everything 
>> with 4-byte characters and not have a problem.
>
> Because there is no complexity in UTF-8. String characters are always 
> accessed consequently. So UCS-4 has no advantage over UTF-8.

I wouldn't go so far as to say *no* complexity, but the cases where the 
complexity is a major issue are fairly rare. As Dmitry says, most operations 
are scans, and UTF-8 was designed so that scans don't need to identify the 
starts and ends of characters (you can't get mismatches when doing pattern 
matching, for instance).

And wasting memory will remain an issue for the foreseeable future. The 
amount of data structures that fit in the various caches have a substantial 
effect on performance. Similarly, the amount of data read/written to 
disk/nonvolatile memory also has a big effect on the cost of those 
operations. While a programmer can ignore those issues (avoiding premature 
optimization is usually a good thing), it's not as clear that a programming 
language design can. If the default representation is slow and large, there 
can be costs in switching to a better representation (see Ada's support for 
UTF-8 for an example!), and also in the perception of the language.

                                         Randy.