From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.5-pre1 (2020-06-20) on ip-172-31-74-118.ec2.internal X-Spam-Level: X-Spam-Status: No, score=-1.9 required=3.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.5-pre1 Path: eternal-september.org!reader02.eternal-september.org!aioe.org!5WHqCw2XxjHb2npjM9GYbw.user.gioia.aioe.org.POSTED!not-for-mail From: "Dmitry A. Kazakov" Newsgroups: comp.lang.ada Subject: Re: Why UTF-8 (was Re: Lower bounds of Strings) Date: Sat, 9 Jan 2021 19:08:11 +0100 Organization: Aioe.org NNTP Server Message-ID: References: <1cc09f04-98f2-4ef3-ac84-9a9ca5aa3fd5n@googlegroups.com> <37ada5ff-eee7-4082-ad20-3bd65b5a2778n@googlegroups.com> NNTP-Posting-Host: 5WHqCw2XxjHb2npjM9GYbw.user.gioia.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 X-Notice: Filtered by postfilter v. 0.9.2 Content-Language: en-US Xref: reader02.eternal-september.org comp.lang.ada:61081 List-Id: On 2021-01-09 15:52, Jeffrey R. Carter wrote: > On 1/9/21 3:31 AM, Randy Brukardt wrote: >> The default String should be UTF-8, the others should be reserved for >> special cases (interfacing in particular). You don't want the default >> string >> type to restrict the contents, and you don't want it to waste a lot of >> space. > > I don't understand this. I presume there was a time when the extra > complexity of UTF-8 was a reasonable price to pay for the larger than > 1-byte character range it provided, and there may be systems where it > still makes sense, but with most systems these days having GB of memory > and TB of storage, the simplicity of using 2 bytes per character seems > worth the wasted space. On my 4-yr-old computer I could do everything > with 4-byte characters and not have a problem. Because there is no complexity in UTF-8. String characters are always accessed consequently. So UCS-4 has no advantage over UTF-8. As for interfaces any string has always two: presentation array interface (encoding) and character array interface (view). The language should better support proper abstractions and then having whatever encoding will be no problem. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de