From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,4fbd260da735f6f4
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII-7-bit
Newsgroups: comp.lang.ada
Subject: Re: Reading and writing a big file in Ada (GNAT) on Windows XP
References: <0hj5339mjmond132qhbn2o01unurs61lbj@4ax.com>
 <1178091967.392381.282510@o5g2000hsb.googlegroups.com>
 <aavi33t9mu7fbv1mt5ovsuec860t00dpsd@4ax.com>
 <1178224048.034635.39010@l77g2000hsb.googlegroups.com>
 <f1dpg0$cgg$1@jacob-sparre.dk> <3eabwlo2ip.fsf@hod.lan.m-e-leypold.de>
 <1178382400.746586.274530@y5g2000hsa.googlegroups.com>
From: Markus E Leypold
 <development-2006-8ecbb5cc8aREMOVETHIS@ANDTHATm-e-leypold.de>
Organization: N/A
Date: Sat, 05 May 2007 19:27:45 +0200
Message-ID: <9qhcqr5gfi.fsf@hod.lan.m-e-leypold.de>
User-Agent: Some cool user agent (SCUG)
Cancel-Lock: sha1:jdJDfS9AoA9OmNGULEQZ1fJxsdY=
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
NNTP-Posting-Host: 88.72.250.189
X-Trace: news.arcor-ip.de 1178385567 88.72.250.189 (5 May 2007 19:19:27 +0200)
X-Complaints-To: abuse@arcor-ip.de
Path: 
 g2news1.google.com!news2.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!newsfeed00.sul.t-online.de!newsfeed01.sul.t-online.de!t-online.de!newsfeed.arcor-ip.de!news.arcor-ip.de!not-for-mail
Xref: g2news1.google.com comp.lang.ada:15602
Date: 2007-05-05T19:27:45+02:00
List-Id: <comp.lang.ada>


Adam Beneschan <adam@irvine.com> writes:

> On May 3, 5:28 pm, Markus E Leypold
> <development-2006-8ecbb5cc8aREMOVET...@ANDTHATm-e-leypold.de> wrote:
>> "Randy Brukardt" <r...@rrsoftware.com> writes:
>> > "Adam Beneschan" <a...@irvine.com> wrote in message
>> >news:1178224048.034635.39010@l77g2000hsb.googlegroups.com...
>> > ...
>> >> It strikes me that Index is the kind of function that really ought to
>> >> be written in assembly language, at least partially.  I notice that
>> >> the version of Linux that I'm using has a built-in function to search
>> >> memory for a substring; this is very descriptively called memmem() and
>> >> has the amusing profile
>>
>> >>    void *memmem (const void *needle, size_t needlelen,
>> >>                             const void *haystack, size_t haystacklen);
>>
>> >> according to the man page.  But I assume this is written to use
>> >> registers optimally and take advantage of the REP instructions (on an
>> >> x86 or Pentium).  I don't know how GNAT implements Index---I haven't
>> >> looked into it.
>>
>> > The big expense in Index is the mapping set or function, not the actual
>> > compare. For Janus/Ada, I had seen a similar problem (a big deal as Index
>> > was used to look for spam patterns), and finally special-cased a number of
>> > common cases (no mapping, single character patterns, and so on). I also
>> > spent a bit of time on the code generator, figuring that this sort of string
>> > manipulation code is common enough that it might as well be generated well.
>> > The updates helped a lot, although they don't quite generate a single
>> > instruction such as is possible. (OTOH, Intel used to recommend avoiding the
>> > block move and compare instructions because they fouled up the pipeline and
>> > thus slowed the overall execution. I don't know if that is still true, but
>> > ifi it is, there might be less benefit to hand-coded assembler than you are
>> > thinking...)
>>
>> I'd like to add, that some years ago I read a rather well written
>> paper about this kind of text searches. Just repeating comparisons in
>> two loops (one to move the offset in haystack, the other to compare
>> need against haystack+offset) is actually not the best method. I.e. if
>> one fails when comparing this
>>
>>      haystack: ... asjkajkaewudajksdkwueqeqweadasdqw3adkadkakd
>>      needle:                                 adasdqwsdfklsdf
>>                                                     ^
>>                                                    comparison fails
>>
>> one can immediately increase the offset by 7, since the needle does
>> not contain a '3' in the part that is before the point where the
>> comparison failed. The technique I read about, used considerations of
>> this kind to compile the 'needle' into some automaton which could skip
>> more than 1 character at a time in haystack and then executed /
>> interpreted this automaton. (Actually the algorithm was a bit more
>> tricky than this, I think: If the automaton sees a character that is
>> not in needle at all, it can immediately skip for the length of needle
>> and I think part of the trick was to start comparing from the end of
>> needle).
>>
>> So in this case a better algorithm is probably the way to go (and I
>> agree, asembler and REP, which would just be the naive algorithm with
>> 2 loops, won't cut it, at least I'd not bet on it)
>
> A big problem, however, is that while such "well-written" papers may
> be mathematically correct, they must make assumptions that aren't
> necessarily true in real life.  For instance, analyses that I've read

While that is certainly true, I don't think it applies in this case,
The algorithm I'm referring to, and which I'm to lazy to look up,
seems to be rather much in use when looking for a give substring in a
larger string/text, like when searching or searching/replacing in a
text editor.


> on searching algorithms show that this or that algorithm may minimize
> the number of times character comparisons are performed.  But in real
> life, not all character comparisons have the same cost.  If a
> processor has a string comparison instruction, for instance, then
> presumably the hardware has been optimized so that the characters will
> be compared significantly faster than if the program did all the index
> register manipulation itself and compared individual characters.

The processor hardware can only search of strings in a "dumb" way,
whereas the algorithm I have been talking about can be much
smarter. Of course I'd expect algorithms like that to have the best
(and most predictable) performance on RISC machines. And it might well
pay to implement the "interpreter" in assembly too.

> Algorithmic analysis, or at least the kind I've seen, usually doesn't
> take this sort of thing into account.  If one pattern-matching

Isn't that some kind of sweeping generalization?

> algorithm can be shown to be better than another, then (for a
> particular processor) there should be some values M and N such that if
> the source length is >M and the pattern length is >N, the
> mathematically faster algorithm actually does run faster---but that
> doesn't help if your real-life inputs are likely to have lengths
> significantly less than that.  The original poster had a pattern with
> a quite small length, for instance.  A technique in which you
> precompile the pattern into an automaton, for instance, is likely to
> make things slower when you're using such a small pattern and calling
> the Index routine thousands or millions of times.

Well -- I'm not trying to peddle that algorithm to you or anybody
else. Indeed I now wish. I've stayed silent instead. 

Originally I intended to point people, who try to find substrings to
the fact that there a much smarter algorithms available which can skip
a significant percentage of character comparisons. Wether this is
faster than brute force dumb comparison would have to be decided in
any single case. Ideally though, in the case of a substring of N
characters, when a single character comparison fails, offset can be
skipped by a K in the order of magnitude of N, that is, if you look
for a substring of N characters, you might only have to do something
on the order of 1/N of the comparisons you'd have to do in the dumb
implementation. Admittedly things are more complicated than that, but
I think one can see, that the point where it becomes as fast as the
dumb case could be reached for small N, esp. if the interpreter is
handwritten assembler or generated by a really good code generator.

As I said: I'm not trying to peddle that algorithm to anybody, so I'm
not inclined to look it up or try a better analysis. I've better
things to do, since _I_ don't have problem with searching for
substrings at the moment.

> This doesn't necessarily help write the Index routine in the best
> way.  In this case, an implementation *could* look at the lengths of
> the strings and make a judgment as to what the best algorithm might be
> (the implementor may have to experiment on each supported processor,
> though, to determine the correct cutoffs).  But I think it would be a
> mistake for an implementor to write Index in a way that uses some
> "optimal" algorithm that makes things better only when the argument
> sizes are >1000, say, and makes things slower when the sizes are less
> than that.

If I understood the "paper" right (actually it was only some dumb
article in some applied programming magazine), the algorithm had been
designed for simple word searched, i.e. substrings around 10
characters.

Regards -- Markus