From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.5-pre1 (2020-06-20) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.5-pre1 Path: eternal-september.org!reader02.eternal-september.org!nntp-feed.chiark.greenend.org.uk!ewrotcd!newsfeed.xs3.de!news.jacob-sparre.dk!franka.jacob-sparre.dk!pnx.dk!.POSTED.rrsoftware.com!not-for-mail From: "Randy Brukardt" Newsgroups: comp.lang.ada Subject: Re: Why UTF-8 (was Re: Lower bounds of Strings) Date: Tue, 12 Jan 2021 01:58:46 -0600 Organization: JSA Research & Innovation Message-ID: References: <1cc09f04-98f2-4ef3-ac84-9a9ca5aa3fd5n@googlegroups.com> <37ada5ff-eee7-4082-ad20-3bd65b5a2778n@googlegroups.com> Injection-Date: Tue, 12 Jan 2021 07:58:47 -0000 (UTC) Injection-Info: franka.jacob-sparre.dk; posting-host="rrsoftware.com:24.196.82.226"; logging-data="20961"; mail-complaints-to="news@jacob-sparre.dk" X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2900.5931 X-RFC2646: Format=Flowed; Response X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.7246 Xref: reader02.eternal-september.org comp.lang.ada:61105 List-Id: "Dmitry A. Kazakov" wrote in message news:rtcre8$1bug$1@gioia.aioe.org... > On 2021-01-09 15:52, Jeffrey R. Carter wrote: >> On 1/9/21 3:31 AM, Randy Brukardt wrote: >>> The default String should be UTF-8, the others should be reserved for >>> special cases (interfacing in particular). You don't want the default >>> string >>> type to restrict the contents, and you don't want it to waste a lot of >>> space. >> >> I don't understand this. I presume there was a time when the extra >> complexity of UTF-8 was a reasonable price to pay for the larger than >> 1-byte character range it provided, and there may be systems where it >> still makes sense, but with most systems these days having GB of memory >> and TB of storage, the simplicity of using 2 bytes per character seems >> worth the wasted space. On my 4-yr-old computer I could do everything >> with 4-byte characters and not have a problem. > > Because there is no complexity in UTF-8. String characters are always > accessed consequently. So UCS-4 has no advantage over UTF-8. I wouldn't go so far as to say *no* complexity, but the cases where the complexity is a major issue are fairly rare. As Dmitry says, most operations are scans, and UTF-8 was designed so that scans don't need to identify the starts and ends of characters (you can't get mismatches when doing pattern matching, for instance). And wasting memory will remain an issue for the foreseeable future. The amount of data structures that fit in the various caches have a substantial effect on performance. Similarly, the amount of data read/written to disk/nonvolatile memory also has a big effect on the cost of those operations. While a programmer can ignore those issues (avoiding premature optimization is usually a good thing), it's not as clear that a programming language design can. If the default representation is slow and large, there can be costs in switching to a better representation (see Ada's support for UTF-8 for an example!), and also in the perception of the language. Randy.