From: "Randy Brukardt" <randy@rrsoftware.com>
Subject: Re: Ada and Unicode
Date: Fri, 8 Apr 2022 23:05:38 -0500 [thread overview]
Message-ID: <t2r0mk$q4d$1@dont-email.me> (raw)
In-Reply-To: t2q3cb$bbt$1@gioia.aioe.org
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:t2q3cb$bbt$1@gioia.aioe.org...
> On 2022-04-08 21:19, Simon Wright wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> writes:
>>
>>> On 2022-04-08 10:56, Simon Wright wrote:
>>>> "Randy Brukardt" <randy@rrsoftware.com> writes:
>>>>
>>>>> If you had an Ada-like language that used a universal UTF-8 string
>>>>> internally, you then would have a lot of old and mostly useless
>>>>> operations supported for array types (since things like slices are
>>>>> mainly useful for string operations).
>>>>
>>>> Just off the top of my head, wouldn't it be better to use
>>>> UTF32-encoded Wide_Wide_Character internally?
>>>
>>> Yep, that is the exactly the problem, a confusion between interface
>>> and implementation.
>>
>> Don't understand. My point was that *when you are implementing this* it
>> mught be easier to deal with 32-bit charactrs/code points/whatever the
>> proper jargon is than with UTF8.
>
> I think it would be more difficult, because you will have to convert from
> and to UTF-8 under the hood or explicitly. UTF-8 is de-facto interface
> standard and I/O standard. That would be 60-70% of all cases you need a
> string. Most string operations like search, comparison, slicing are
> isomorphic between code points and octets. So you would win nothing from
> keeping strings internally as arrays of code points.
I basically agree with Dmitry here. The internal representation is an
implementation detail, but it seems likely that you would want to store
UTF-8 strings directly; they're almost always going to be half the size
(even for languages using their own characters like Greek) and for most of
us, they'll be just a bit more than a quarter the size. The amount of bytes
you copy around matters; the number of operations where code points are
needed is fairly small.
The main problem with UTF-8 is representing the code point positions in a
way that they (a) aren't abused and (b) don't cost too much to calculate.
Just using character indexes is too expensive for UTF-8 and UTF-16
representations, and using octet indexes is unsafe (since the splitting a
character representation is a possibility). I'd probably use an abstract
character position type that was implemented with an octet index under the
covers.
I think that would work OK as doing math on those is suspicious with a UTF
representation. We're spoiled from using Latin-1 representations, of course,
but generally one is interested in 5 characters, not 5 octets. And the
number of octets in 5 characters depends on the string. So most of the sorts
of operations that I tend to do (for instance from some code I was fixing
earlier today):
if Fort'Length > 6 and then
Font(2..6) = "Arial" then
This would be a bad idea if one is using any sort of universal
representation -- you don't know how many octets is in the string literal so
you can't assume a number in the test string. So the slice is dangerous
(even though in this particular case it would be OK since the test string is
all Ascii characters -- but I wouldn't want users to get in the habit of
assuming such things).
[BTW, the above was a bad idea anyway, because it turns out that the
function in the Ada library returned bounds that don't start at 1. So the
slice was usually out of range -- which is why I was looking at the code.
Another thing that we could do without. Slices are evil, since they *seem*
to be the right solution, yet rarely are in practice without a lot of
hoops.]
> The situation is comparable to Unbounded_Strings. The implementation is
> relatively simple, but the user must carry the burden of calling To_String
> and To_Unbounded_String all over the application and the processor must
> suffer the overhead of copying arrays here and there.
Yes, but that happens because Ada doesn't really have a string abstraction,
so when you try to build one, you can't fully do the job. One presumes that
a new language with a universal UTF-8 string wouldn't have that problem. (As
previously noted, I don't see much point in trying to patch up Ada with a
bunch of UTF-8 string packages; you would need an entire new set of
Ada.Strings libraries and I/O libraries, and then you'd have all of the old
stuff messing up resolution, using the best names, and confusing everything.
A cleaner slate is needed.)
Randy.
next prev parent reply other threads:[~2022-04-09 4:05 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-17 22:03 Ada and Unicode DrPi
2021-04-18 0:02 ` Luke A. Guest
2021-04-19 9:09 ` DrPi
2021-04-19 8:29 ` Maxim Reznik
2021-04-19 9:28 ` DrPi
2021-04-19 13:50 ` Maxim Reznik
2021-04-19 15:51 ` DrPi
2021-04-19 11:15 ` Simon Wright
2021-04-19 11:50 ` Luke A. Guest
2021-04-19 15:53 ` DrPi
2022-04-03 19:20 ` Thomas
2022-04-04 6:10 ` Vadim Godunko
2022-04-04 14:19 ` Simon Wright
2022-04-04 15:11 ` Simon Wright
2022-04-05 7:59 ` Vadim Godunko
2022-04-08 9:01 ` Simon Wright
2023-03-30 23:35 ` Thomas
2022-04-04 14:33 ` Simon Wright
2021-04-19 9:08 ` Stephen Leake
2021-04-19 9:34 ` Dmitry A. Kazakov
2021-04-19 11:56 ` Luke A. Guest
2021-04-19 12:13 ` Luke A. Guest
2021-04-19 15:48 ` DrPi
2021-04-19 12:52 ` Dmitry A. Kazakov
2021-04-19 13:00 ` Luke A. Guest
2021-04-19 13:10 ` Dmitry A. Kazakov
2021-04-19 13:15 ` Luke A. Guest
2021-04-19 13:31 ` Dmitry A. Kazakov
2022-04-03 17:24 ` Thomas
2021-04-19 13:24 ` J-P. Rosen
2021-04-20 19:13 ` Randy Brukardt
2022-04-03 18:04 ` Thomas
2022-04-06 18:57 ` J-P. Rosen
2022-04-07 1:30 ` Randy Brukardt
2022-04-08 8:56 ` Simon Wright
2022-04-08 9:26 ` Dmitry A. Kazakov
2022-04-08 19:19 ` Simon Wright
2022-04-08 19:45 ` Dmitry A. Kazakov
2022-04-09 4:05 ` Randy Brukardt [this message]
2022-04-09 7:43 ` Simon Wright
2022-04-09 10:27 ` DrPi
2022-04-09 16:46 ` Dennis Lee Bieber
2022-04-09 18:59 ` DrPi
2022-04-10 5:58 ` Vadim Godunko
2022-04-10 18:59 ` DrPi
2022-04-12 6:13 ` Randy Brukardt
2021-04-19 16:07 ` DrPi
2021-04-20 19:06 ` Randy Brukardt
2022-04-03 18:37 ` Thomas
2022-04-04 23:52 ` Randy Brukardt
2023-03-31 3:06 ` Thomas
2023-04-01 10:18 ` Randy Brukardt
2021-04-19 16:14 ` DrPi
2021-04-19 17:12 ` Björn Lundin
2021-04-19 19:44 ` DrPi
2022-04-16 2:32 ` Thomas
2021-04-19 13:18 ` Vadim Godunko
2022-04-03 16:51 ` Thomas
2023-04-04 0:02 ` Thomas
2021-04-19 22:40 ` Shark8
2021-04-20 15:05 ` Simon Wright
2021-04-20 19:17 ` Randy Brukardt
2021-04-20 20:04 ` Simon Wright
replies disabled
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox