comp.lang.ada
 help / color / mirror / Atom feed
From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: Why UTF-8 (was Re: Lower bounds of Strings)
Date: Tue, 12 Jan 2021 01:58:46 -0600	[thread overview]
Message-ID: <rtjkrn$kf1$1@franka.jacob-sparre.dk> (raw)
In-Reply-To: rtcre8$1bug$1@gioia.aioe.org

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:rtcre8$1bug$1@gioia.aioe.org...
> On 2021-01-09 15:52, Jeffrey R. Carter wrote:
>> On 1/9/21 3:31 AM, Randy Brukardt wrote:
>>> The default String should be UTF-8, the others should be reserved for
>>> special cases (interfacing in particular). You don't want the default 
>>> string
>>> type to restrict the contents, and you don't want it to waste a lot of
>>> space.
>>
>> I don't understand this. I presume there was a time when the extra 
>> complexity of UTF-8 was a reasonable price to pay for the larger than 
>> 1-byte character range it provided, and there may be systems where it 
>> still makes sense, but with most systems these days having GB of memory 
>> and TB of storage, the simplicity of using 2 bytes per character seems 
>> worth the wasted space. On my 4-yr-old computer I could do everything 
>> with 4-byte characters and not have a problem.
>
> Because there is no complexity in UTF-8. String characters are always 
> accessed consequently. So UCS-4 has no advantage over UTF-8.

I wouldn't go so far as to say *no* complexity, but the cases where the 
complexity is a major issue are fairly rare. As Dmitry says, most operations 
are scans, and UTF-8 was designed so that scans don't need to identify the 
starts and ends of characters (you can't get mismatches when doing pattern 
matching, for instance).

And wasting memory will remain an issue for the foreseeable future. The 
amount of data structures that fit in the various caches have a substantial 
effect on performance. Similarly, the amount of data read/written to 
disk/nonvolatile memory also has a big effect on the cost of those 
operations. While a programmer can ignore those issues (avoiding premature 
optimization is usually a good thing), it's not as clear that a programming 
language design can. If the default representation is slow and large, there 
can be costs in switching to a better representation (see Ada's support for 
UTF-8 for an example!), and also in the perception of the language.

                                         Randy.


  reply	other threads:[~2021-01-12  7:58 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-05 11:04 Lower bounds of Strings Stephen Davies
2021-01-05 11:57 ` Dmitry A. Kazakov
2021-01-05 12:32   ` Jeffrey R. Carter
2021-01-05 13:40     ` Dmitry A. Kazakov
2021-01-05 14:31       ` Stephen Davies
2021-01-05 17:24         ` Stephen Davies
2021-01-05 18:28           ` Jeffrey R. Carter
2021-01-05 21:02             ` Stephen Davies
2021-01-07 10:38               ` Stephen Davies
2021-01-07 21:39                 ` Randy Brukardt
2021-01-07 22:38                   ` Stephen Davies
2021-01-05 12:24 ` Luke A. Guest
2021-01-05 12:49 ` Simon Wright
2021-01-05 12:51 ` Jeffrey R. Carter
2021-01-06  3:08 ` Randy Brukardt
2021-01-06  9:13   ` Dmitry A. Kazakov
2021-01-07  0:17     ` Randy Brukardt
2021-01-07  9:57       ` Dmitry A. Kazakov
2021-01-07 22:03         ` Randy Brukardt
2021-01-08  9:04           ` Dmitry A. Kazakov
2021-01-08 17:23           ` Shark8
2021-01-08 20:19             ` Dmitry A. Kazakov
2021-01-09  2:18               ` Randy Brukardt
2021-01-09 10:53                 ` Dmitry A. Kazakov
2021-01-12  8:19                   ` Randy Brukardt
2021-01-12  9:37                     ` Dmitry A. Kazakov
2021-01-09  2:31             ` Randy Brukardt
2021-01-09 14:52               ` Why UTF-8 (was Re: Lower bounds of Strings) Jeffrey R. Carter
2021-01-09 18:08                 ` Dmitry A. Kazakov
2021-01-12  7:58                   ` Randy Brukardt [this message]
2021-01-11 21:35               ` Lower bounds of Strings Shark8
2021-01-12  8:12                 ` Randy Brukardt
2021-01-12 20:51                   ` Shark8
2021-01-12 22:56                     ` Randy Brukardt
2021-01-13 12:00                       ` Dmitry A. Kazakov
2021-01-13 13:27                         ` AdaMagica
2021-01-13 13:53                           ` Dmitry A. Kazakov
2021-01-13 14:08                   ` Jeffrey R. Carter
2021-01-14 11:38 ` AdaMagica
2021-01-14 12:27   ` Dmitry A. Kazakov
2021-01-14 13:31   ` AdaMagica
2021-01-14 14:02   ` Jeffrey R. Carter
2021-01-14 14:34     ` Dmitry A. Kazakov
2021-01-14 15:28       ` Shark8
2021-01-14 15:41         ` Dmitry A. Kazakov
2021-01-19 21:02           ` G.B.
2021-01-19 22:27             ` Dmitry A. Kazakov
2021-01-20 20:10               ` G.B.
2021-01-20 20:25                 ` Dmitry A. Kazakov
2021-01-15 10:24   ` Stephen Davies
2021-01-15 11:41     ` J-P. Rosen
2021-01-15 17:35       ` Stephen Davies
2021-01-15 19:36         ` Egil H H
2021-01-16 12:57           ` Stephen Davies
2021-01-17 14:10         ` Stephen Davies
2021-01-19  5:48           ` Randy Brukardt
2021-01-19  6:13         ` Gautier write-only address
2021-01-15 11:48     ` Jeffrey R. Carter
2021-01-15 13:34       ` Dmitry A. Kazakov
2021-01-15 13:56       ` Stephen Davies
2021-01-15 15:12         ` Jeffrey R. Carter
2021-01-15 17:22           ` Stephen Davies
2021-01-15 21:10             ` Jeffrey R. Carter
2021-01-15 14:00       ` Stephen Davies
2021-01-16  9:30     ` G.B.
2021-01-16 13:13       ` Stephen Davies
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox