From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=0.2 required=5.0 tests=BAYES_00,INVALID_MSGID,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,f868292008c639ce
X-Google-Attributes: gid103376,public
From: Florian Weimer <fw@deneb.cygnus.argh.org>
Subject: Re: C vs. Ada - strings
Date: 2000/05/05
Message-ID: <87k8h9v1iy.fsf@deneb.cygnus.argh.org>#1/1
X-Deja-AN: 619511415
References: <390F0D93.F835FAD9@ftw.rsc.raytheon.com>
Mail-Copies-To: never
Content-Type: text/plain; charset=us-ascii
X-Complaints-To: abuse@cygnus.argh.org
X-Trace: deneb.cygnus.argh.org 957515653 11861 192.168.1.2 (5 May 2000
 08:34:13 GMT)
Organization: Penguin on board
User-Agent: Gnus/5.0806 (Gnus v5.8.6) Emacs/20.6
Mime-Version: 1.0
Reply-To: Florian Weimer <fw-usenet@deneb.cygnus.argh.org>
NNTP-Posting-Date: 5 May 2000 08:34:13 GMT
Newsgroups: comp.lang.ada
Date: 2000-05-05T08:34:13+00:00
List-Id: <comp.lang.ada>

Wes Groleau <wwgrol@ftw.rsc.raytheon.com> writes:

> Two offices adjoining mine are occupied by persons
> fond of saying "Ada strings suck"

I've just written a parser for small, regular language -- in C.  For
these kind of jobs, C strings are quite handy, and I even think that
the code is easy to read (it reuses the same idiom many times: keep
a pointer to the beginning of the token, iterate over the token and
replace the character delimiting it from the next token by a '\0', and
finally skip additional delimiters).  Of course, this only works if
you're dealing with text strings, if you are dealing with binary data,
you can't use in-band signalling of string terminators.

A direct Ada translation would be a bit more complicated because you
would have to keep track both of the start and the end of the tokens.
Copying the token to a different, unbounded string variable is perhaps
a translation which is more appropriate and even more readable than
the C solution.  Obviously, you can't do this in standard C because
there are no unbounded strings.  Each time you have to create strings
whose maximum length is not given at compile time, you have to use
heap allocation and worry about all the consequences.  I find this
rather unacceptable because unbounded strings are quite common.

> I've had to twice write packages similar to the Ada 95
> string packages to avoid imitating other folks who
> continually re-invent the same string handling logic
> over and over.

Most probably, I'll write my own string package some day, but entirely
due to efficiency considerations.  In fact, I'm going mimic the
standard Ada interface as closely as possible.

I have only worked with the GNAT implementation of the standard Ada
strings, and two things annoy me particularly: First, the bounded
strings tend to increase code size and compile time considerably.  The
string package Gautier mentioned could be used as a replacement in
places where this is a concern, and perhaps the bounded strings can be
implemented on top of it, reducing code bloat.

Second, the unbounded strings are inefficient to a degree that it
starts to irritate people.  (There even was a thread with this topic
a few weeks ago.)  A reference-count based implementations could
be considerably faster: you could preallocate storage so that you
don't have to use an allocator and copy the entire string each time
you add a character to the string, you could take into account that
storage is allocated in chunks of certain sizes (for example, the
smallest data chunck allocated by GNU malloc is 12 bytes large on
32 bit platforms), you don't have to use allocators constantly if
you pass around strings, and so on.  I'm sure such an implemention
will greatly increase overall performance, although it is much more
complicated than the current one.

Unfortunately, it is difficult to exactly duplicate the semantics
of the standard Ada unbouded strings.  Standard Ada strings are
immutable, but with some reference count tricks, you can even do
in-place modification without losing immutability.  Another issue is
perhaps more problematic: At first, reference counts are not task
at all.  But a task-safe implementation does not require extensive
locking if the hardware provides atomic load-increment-store and
load-decrement-store-compare-to-zero operations on integers of a
suitable size (e.g., 32 bits and more).  The x86 architecture does
have these operations (and for this kind of application, they are even
completely SMP-safe), but these instructions look very CISCy, so they
are probably not available on other architectures.

At the moment, I'm not sure if mimicking the Ada semantics of
unbounded strings is worth the trouble at all.  Perhaps it's better to
make the strings with reference counts mutable.  (Immutable strings
which aren't safe for tasking aren't an option.  I'm sure people tend
to forget that they're unsafe, and the resulting bugs are horribly to
debug.)