Reading and writing a big file in Ada (GNAT) on Windows XP

comp.lang.ada
 help / color / mirror / Atom feed

* Reading and writing a big file in Ada (GNAT) on Windows XP
@ 2007-04-28  5:03 Fionn Mac Cumhaill
  2007-04-28  5:20 ` Gautier
                   ` (7 more replies)
  0 siblings, 8 replies; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-04-28  5:03 UTC (permalink / raw)


I have written a very simple program that runs  on a Windows XP
machine. (Pentium D , I don't remember the clock speed, it is in
middle of Pentium D clock speeds, 2 Gb memory) It is compiled with the
MinGW GNAT 3.4.5 Ada compiler.

All it does is read lines in a loop from a text file with
Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
lines that it reads, and writes the lines to an output file with
Put_Line. 

The modifications consist of replacing a slice of text at the end of a
line with another bit of text. The biggest slice is 10 characters, and
the replacement slice is always smaller than the original slice. An
occasional line of text is about 6000 characters long, but most are
about 700 haracters.  Get_Line reads them into a String variable that
is 10,000 characters long.


The problem is that the input file has more than 10 million lines of
text in it. The program works perfectly, but takes about 5 hours to
run. The Cygwin version of wc can count the lines in the input file in
less than one minute.

The program that produces the file pulls data from a SQL Server 7
running on an 800 MhZ Pentium III machine with 512 Mb and writes it to
a file in 1-1/2 hours. Almost all (99.9%) of the lines are rows from a
database table.

Why is this so slow? 
Do I have an Ada problem, a GNAT problem, or a MinGW problem?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill
@ 2007-04-28  5:20 ` Gautier
  2007-04-29 21:17   ` Fionn Mac Cumhaill
  2007-04-28  5:25 ` tmoran
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Gautier @ 2007-04-28  5:20 UTC (permalink / raw)


Fionn Mac Cumhaill:
What is the time when you just rewrite the lines (no modification) ?
Perhaps there is (also) a bottleneck (like a search in a search) in the 
modification algorithm.
At least you can know how much time it takes.
HTH
______________________________________________________________
Gautier         -- http://www.mysunrise.ch/users/gdm/index.htm
Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm

NB: For a direct answer, e-mail address on the Web site!



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  5:20 ` Gautier
@ 2007-04-29 21:17   ` Fionn Mac Cumhaill
  0 siblings, 0 replies; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-04-29 21:17 UTC (permalink / raw)


On Sat, 28 Apr 2007 07:20:03 +0200, Gautier <gautier@fakeaddress.nil>
wrote:

>Fionn Mac Cumhaill:
>What is the time when you just rewrite the lines (no modification) ?
>Perhaps there is (also) a bottleneck (like a search in a search) in the 
>modification algorithm.
>At least you can know how much time it takes.
>HTH
>______________________________________________________________
>Gautier         -- http://www.mysunrise.ch/users/gdm/index.htm
>Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm
>
I'll try it with no modifications, but I would be surprised if the
modification itself takes up much time.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill
  2007-04-28  5:20 ` Gautier
@ 2007-04-28  5:25 ` tmoran
  2007-04-28  6:56 ` Martin Krischik
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: tmoran @ 2007-04-28  5:25 UTC (permalink / raw)


>The problem is that the input file has more than 10 million lines of
>text in it. The program works perfectly, but takes about 5 hours to
>run. The Cygwin version of wc can count the lines in the input file in
  Is your problem in Text_IO or in the scanning & "modifications"
code?  If your program does no scanning and makes no modifications how
long does it take?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill
  2007-04-28  5:20 ` Gautier
  2007-04-28  5:25 ` tmoran
@ 2007-04-28  6:56 ` Martin Krischik
  2007-04-28 17:12   ` tmoran
  2007-04-28 12:41 ` Jeffrey Creem
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Martin Krischik @ 2007-04-28  6:56 UTC (permalink / raw)

Fionn Mac Cumhaill wrote:

> All it does is read lines in a loop from a text file with
> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
> lines that it reads, and writes the lines to an output file with
> Put_Line.

I wrote something like as well. Only I did it genericly as part of the AdaCL
[1] library. It could be extended to solve your problem.

Now, what I did was to create pipline with three threads - one to read the
line, one to make the change, and one to write the lines. Each step of the
pipline where connected with a fifo buffer. The idea was that while the
system is waiting for new data from the disk it could convert an few lines
from the buffer.

When I wrote it I use SCSI drives so I could expect better performance from
such a setup. But you might still give it a try. I used used pretty modest
buffering - in your case I would go for bigger buffers.

Martin

[1] http://adacl.sf.net
-- 
mailto://krischik@users.sourceforge.net
Ada programming at: http://ada.krischik.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  6:56 ` Martin Krischik
@ 2007-04-28 17:12   ` tmoran
  0 siblings, 0 replies; 45+ messages in thread
From: tmoran @ 2007-04-28 17:12 UTC (permalink / raw)


>Now, what I did was to create pipline with three threads - one to read the
>line, one to make the change, and one to write the lines. Each step of the
   If the scanning&modification is taking a lot of CPU time and you have
a multicore CPU, it might also be worthwhile to have more than one "make
the change" tasks.  Experiment shows that helps on a simple word counter.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill
                   ` (2 preceding siblings ...)
  2007-04-28  6:56 ` Martin Krischik
@ 2007-04-28 12:41 ` Jeffrey Creem
  2007-04-29 21:35   ` Fionn Mac Cumhaill
  2007-04-28 13:22 ` (see below)
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Jeffrey Creem @ 2007-04-28 12:41 UTC (permalink / raw)


Fionn Mac Cumhaill wrote:
> I have written a very simple program that runs  on a Windows XP
> machine. (Pentium D , I don't remember the clock speed, it is in
> middle of Pentium D clock speeds, 2 Gb memory) It is compiled with the
> MinGW GNAT 3.4.5 Ada compiler.
> 
> All it does is read lines in a loop from a text file with
> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
> lines that it reads, and writes the lines to an output file with
> Put_Line. 
> 
> The modifications consist of replacing a slice of text at the end of a
> line with another bit of text. The biggest slice is 10 characters, and
> the replacement slice is always smaller than the original slice. An
> occasional line of text is about 6000 characters long, but most are
> about 700 haracters.  Get_Line reads them into a String variable that
> is 10,000 characters long.
> 
> 
> The problem is that the input file has more than 10 million lines of
> text in it. The program works perfectly, but takes about 5 hours to
> run. The Cygwin version of wc can count the lines in the input file in
> less than one minute.
> 
> The program that produces the file pulls data from a SQL Server 7
> running on an 800 MhZ Pentium III machine with 512 Mb and writes it to
> a file in 1-1/2 hours. Almost all (99.9%) of the lines are rows from a
> database table.
> 
> Why is this so slow? 
> Do I have an Ada problem, a GNAT problem, or a MinGW problem?

Is there any chance that the code is doing things like initializing the 
string to blanks/nulls before each getline and or scanning beyond the 
value returned for last when doing the replace?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28 12:41 ` Jeffrey Creem
@ 2007-04-29 21:35   ` Fionn Mac Cumhaill
  0 siblings, 0 replies; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-04-29 21:35 UTC (permalink / raw)


On Sat, 28 Apr 2007 08:41:26 -0400, Jeffrey Creem <jeff@thecreems.com>
wrote:

>Fionn Mac Cumhaill wrote:
>> I have written a very simple program
>> Do I have an Ada problem, a GNAT problem, or a MinGW problem?
  ... snip ...
>
>Is there any chance that the code is doing things like initializing the 
>string to blanks/nulls before each getline and or scanning beyond the 
>value returned for last when doing the replace?

It does indeed initialize the string buffer. I put that in a a quick
bug fix, but the version without the initialization still took 5
hours.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill
                   ` (3 preceding siblings ...)
  2007-04-28 12:41 ` Jeffrey Creem
@ 2007-04-28 13:22 ` (see below)
  2007-04-28 17:56 ` Simon Wright
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: (see below) @ 2007-04-28 13:22 UTC (permalink / raw)


On 28/4/07 06:03, in article 0hj5339mjmond132qhbn2o01unurs61lbj@4ax.com,
"Fionn Mac Cumhaill" <invisible@hiding.from.spam> wrote:

> I have written a very simple program that runs  on a Windows XP
> machine. (Pentium D , I don't remember the clock speed, it is in
> middle of Pentium D clock speeds, 2 Gb memory) It is compiled with the
> MinGW GNAT 3.4.5 Ada compiler.
> 
> All it does is read lines in a loop from a text file with
> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
> lines that it reads, and writes the lines to an output file with
> Put_Line. 
...
> The problem is that the input file has more than 10 million lines of
> text in it. The program works perfectly, but takes about 5 hours to
> run. The Cygwin version of wc can count the lines in the input file in
> less than one minute.

I have an assembler that reads source code and writes an equal number of
lines of object code, all I/O done using Ada.Text_IO.{Get_Line, Put(str)}.

That takes less than 6 sec CPU time, on a Mac Powerbook 1.67GHz laptop,
for 1_000_000 lines of input, giving a performance of around 10 million
lines per minute - around 300 times faster than you are getting - so I think
it unlikely that Ada.Text_IO is your problem.

-- 
Bill Findlay
<surname><forename> chez blueyonder.co.uk





^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill
                   ` (4 preceding siblings ...)
  2007-04-28 13:22 ` (see below)
@ 2007-04-28 17:56 ` Simon Wright
  2007-04-28 18:28   ` Jeffrey Creem
  2007-04-29 21:42   ` Fionn Mac Cumhaill
  2007-04-28 19:12 ` Jeffrey R. Carter
  2007-05-02  7:46 ` george
  7 siblings, 2 replies; 45+ messages in thread
From: Simon Wright @ 2007-04-28 17:56 UTC (permalink / raw)


Fionn Mac Cumhaill <invisible@hiding.from.spam> writes:

> The modifications consist of replacing a slice of text at the end of
> a line with another bit of text. The biggest slice is 10 characters,
> and the replacement slice is always smaller than the original
> slice. An occasional line of text is about 6000 characters long, but
> most are about 700 haracters.  Get_Line reads them into a String
> variable that is 10,000 characters long.

Yesterday I found a performance (and stack!) related problem in GNAT's
Ada.Strings.Fixed.Index. The string is mapped onto the stack before
the search is done. But I wouldn't have thought this would result in a
5 hour run time. How do you decide which lines need conversion?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28 17:56 ` Simon Wright
@ 2007-04-28 18:28   ` Jeffrey Creem
  2007-04-29  7:20     ` Simon Wright
  2007-04-29 21:44     ` Fionn Mac Cumhaill
  2007-04-29 21:42   ` Fionn Mac Cumhaill
  1 sibling, 2 replies; 45+ messages in thread
From: Jeffrey Creem @ 2007-04-28 18:28 UTC (permalink / raw)


Simon Wright wrote:
> Fionn Mac Cumhaill <invisible@hiding.from.spam> writes:
> 
>> The modifications consist of replacing a slice of text at the end of
>> a line with another bit of text. The biggest slice is 10 characters,
>> and the replacement slice is always smaller than the original
>> slice. An occasional line of text is about 6000 characters long, but
>> most are about 700 haracters.  Get_Line reads them into a String
>> variable that is 10,000 characters long.
> 
> Yesterday I found a performance (and stack!) related problem in GNAT's
> Ada.Strings.Fixed.Index. The string is mapped onto the stack before
> the search is done. But I wouldn't have thought this would result in a
> 5 hour run time. How do you decide which lines need conversion?

I agree, it does not seem that either that or the guesses I had related 
to zeroing at the string (a common mistake) would result in the level of 
slowdown being seen...However, now that you mention it, I see a few ugly 
things like that related to Ada.Strings.Fixed. Did you submit a bug 
report to FSF GCC?


As for the original poster, any chance you are doing dynamic memory 
allocation and leaking and perhaps causing swapping?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28 18:28   ` Jeffrey Creem
@ 2007-04-29  7:20     ` Simon Wright
  2007-04-29 21:44     ` Fionn Mac Cumhaill
  1 sibling, 0 replies; 45+ messages in thread
From: Simon Wright @ 2007-04-29  7:20 UTC (permalink / raw)


Jeffrey Creem <jeff@thecreems.com> writes:

> Simon Wright wrote:
>> Fionn Mac Cumhaill <invisible@hiding.from.spam> writes:
>>
>>> The modifications consist of replacing a slice of text at the end of
>>> a line with another bit of text. The biggest slice is 10 characters,
>>> and the replacement slice is always smaller than the original
>>> slice. An occasional line of text is about 6000 characters long, but
>>> most are about 700 haracters.  Get_Line reads them into a String
>>> variable that is 10,000 characters long.
>>
>> Yesterday I found a performance (and stack!) related problem in GNAT's
>> Ada.Strings.Fixed.Index. The string is mapped onto the stack before
>> the search is done. But I wouldn't have thought this would result in a
>> 5 hour run time. How do you decide which lines need conversion?
>
> I agree, it does not seem that either that or the guesses I had
> related to zeroing at the string (a common mistake) would result in
> the level of slowdown being seen...However, now that you mention it, I
> see a few ugly things like that related to Ada.Strings.Fixed. Did you
> submit a bug report to FSF GCC?

No, to AdaCore.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28 18:28   ` Jeffrey Creem
  2007-04-29  7:20     ` Simon Wright
@ 2007-04-29 21:44     ` Fionn Mac Cumhaill
  1 sibling, 0 replies; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-04-29 21:44 UTC (permalink / raw)


On Sat, 28 Apr 2007 14:28:56 -0400, Jeffrey Creem <jeff@thecreems.com>
wrote:

>Simon Wright wrote:
>> Fionn Mac Cumhaill <invisible@hiding.from.spam> writes:
>> 
>>> The modifications consist of replacing a slice of text at the end of
>>> a line with another bit of text. The biggest slice is 10 characters,
>>> and the replacement slice is always smaller than the original
>>> slice. An occasional line of text is about 6000 characters long, but
>>> most are about 700 haracters.  Get_Line reads them into a String
>>> variable that is 10,000 characters long.
>> 
>> Yesterday I found a performance (and stack!) related problem in GNAT's
>> Ada.Strings.Fixed.Index. The string is mapped onto the stack before
>> the search is done. But I wouldn't have thought this would result in a
>> 5 hour run time. How do you decide which lines need conversion?
>
>I agree, it does not seem that either that or the guesses I had related 
>to zeroing at the string (a common mistake) would result in the level of 
>slowdown being seen...However, now that you mention it, I see a few ugly 
>things like that related to Ada.Strings.Fixed. Did you submit a bug 
>report to FSF GCC?
>
>
>As for the original poster, any chance you are doing dynamic memory 
>allocation and leaking and perhaps causing swapping?

I didn't see any swapfile activity at all.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28 17:56 ` Simon Wright
  2007-04-28 18:28   ` Jeffrey Creem
@ 2007-04-29 21:42   ` Fionn Mac Cumhaill
  2007-04-30  0:48     ` Jeffrey R. Carter
  1 sibling, 1 reply; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-04-29 21:42 UTC (permalink / raw)


On Sat, 28 Apr 2007 18:56:47 +0100, Simon Wright
<simon.j.wright@mac.com> wrote:

>Fionn Mac Cumhaill <invisible@hiding.from.spam> writes:
>
>> The modifications consist of replacing a slice of text at the end of
>> a line with another bit of text. The biggest slice is 10 characters,
>> and the replacement slice is always smaller than the original
>> slice. An occasional line of text is about 6000 characters long, but
>> most are about 700 haracters.  Get_Line reads them into a String
>> variable that is 10,000 characters long.
>
>Yesterday I found a performance (and stack!) related problem in GNAT's
>Ada.Strings.Fixed.Index. The string is mapped onto the stack before
>the search is done. But I wouldn't have thought this would result in a
>5 hour run time. How do you decide which lines need conversion?

I use Index to search for the string that needs to be replaced. if the
returned value is not zero, that line needs to be changed. In 99.99%
of the changes, I look for ", ''),"  and replace it with "),"



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-29 21:42   ` Fionn Mac Cumhaill
@ 2007-04-30  0:48     ` Jeffrey R. Carter
  2007-04-30  2:30       ` Fionn Mac Cumhaill
  0 siblings, 1 reply; 45+ messages in thread
From: Jeffrey R. Carter @ 2007-04-30  0:48 UTC (permalink / raw)


Fionn Mac Cumhaill wrote:
> 
> I use Index to search for the string that needs to be replaced. if the
> returned value is not zero, that line needs to be changed. In 99.99%
> of the changes, I look for ", ''),"  and replace it with "),"

In a previous post, you said the pattern is found near the end of the 
string. Can you use the "Going => Backward" option to Index, and if so, 
do you?

-- 
Jeff Carter
"You've got the brain of a four-year-old boy,
and I bet he was glad to get rid of it."
Horse Feathers
47



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-30  0:48     ` Jeffrey R. Carter
@ 2007-04-30  2:30       ` Fionn Mac Cumhaill
  2007-04-30  4:21         ` tmoran
  0 siblings, 1 reply; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-04-30  2:30 UTC (permalink / raw)


On Mon, 30 Apr 2007 00:48:36 GMT, "Jeffrey R. Carter"
<jrcarter@acm.org> wrote:

>Fionn Mac Cumhaill wrote:
>> 
>> I use Index to search for the string that needs to be replaced. if the
>> returned value is not zero, that line needs to be changed. In 99.99%
>> of the changes, I look for ", ''),"  and replace it with "),"
>
>In a previous post, you said the pattern is found near the end of the 
>string. Can you use the "Going => Backward" option to Index, and if so, 
>do you?

I didn't try that, but I will when I get back to work.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-30  2:30       ` Fionn Mac Cumhaill
@ 2007-04-30  4:21         ` tmoran
  0 siblings, 0 replies; 45+ messages in thread
From: tmoran @ 2007-04-30  4:21 UTC (permalink / raw)


   Using Gnat 3.15p my Windows 2000 3.0GHz machine with an older disk
wrote 10K 700 character lines in 0.14 sec.  It read them back, doing
Ada.Strings.Fixed.Index for "b" (which they did not contain) in 1.5
seconds.  That should translate to 1650 seconds or 27.5 minutes to do a
read+scan+write on 10M such lines.
  Skipping the IO and just doing the scan 10K times took 0.4 seconds.
Replacing the Ada.Strings.Fixed.Index with a loop searching for
'b' took 0.02 seconds.
  So, if your program takes 5 hours something seriously unpleasant
is going on.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill
                   ` (5 preceding siblings ...)
  2007-04-28 17:56 ` Simon Wright
@ 2007-04-28 19:12 ` Jeffrey R. Carter
  2007-04-29 21:46   ` Fionn Mac Cumhaill
  2007-05-02  7:46 ` george
  7 siblings, 1 reply; 45+ messages in thread
From: Jeffrey R. Carter @ 2007-04-28 19:12 UTC (permalink / raw)


Fionn Mac Cumhaill wrote:
> 
> All it does is read lines in a loop from a text file with
> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
> lines that it reads, and writes the lines to an output file with
> Put_Line. 
> 
> The modifications consist of replacing a slice of text at the end of a
> line with another bit of text. The biggest slice is 10 characters, and
> the replacement slice is always smaller than the original slice. An
> occasional line of text is about 6000 characters long, but most are
> about 700 haracters.  Get_Line reads them into a String variable that
> is 10,000 characters long.
> 
> The problem is that the input file has more than 10 million lines of
> text in it. The program works perfectly, but takes about 5 hours to
> run. The Cygwin version of wc can count the lines in the input file in
> less than one minute.
> 
> Why is this so slow? 
> Do I have an Ada problem, a GNAT problem, or a MinGW problem?

It's hard to tell. Text_IO is quite heavy-weight compared to C text I/O, 
but I'd be surprised if it made that much difference.

This sounds as if it would be a pretty simple program; if so, you could 
post it (or a reasonable facsimile, if there's a reason you can't post 
the actual code) here, and we would be better able to help.

-- 
Jeff Carter
"Nobody expects the Spanish Inquisition!"
Monty Python's Flying Circus
22



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28 19:12 ` Jeffrey R. Carter
@ 2007-04-29 21:46   ` Fionn Mac Cumhaill
  2007-05-01 14:10     ` Fionn Mac Cumhaill
  2007-05-06 21:55     ` Quarc
  0 siblings, 2 replies; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-04-29 21:46 UTC (permalink / raw)


On Sat, 28 Apr 2007 19:12:33 GMT, "Jeffrey R. Carter"
<jrcarter@acm.org> wrote:

>Fionn Mac Cumhaill wrote:
>> 
>> All it does is read lines in a loop from a text file with
>> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
>> lines that it reads, and writes the lines to an output file with
>> Put_Line. 
>> 
>> The modifications consist of replacing a slice of text at the end of a
>> line with another bit of text. The biggest slice is 10 characters, and
>> the replacement slice is always smaller than the original slice. An
>> occasional line of text is about 6000 characters long, but most are
>> about 700 haracters.  Get_Line reads them into a String variable that
>> is 10,000 characters long.
>> 
>> The problem is that the input file has more than 10 million lines of
>> text in it. The program works perfectly, but takes about 5 hours to
>> run. The Cygwin version of wc can count the lines in the input file in
>> less than one minute.
>> 
>> Why is this so slow? 
>> Do I have an Ada problem, a GNAT problem, or a MinGW problem?
>
>It's hard to tell. Text_IO is quite heavy-weight compared to C text I/O, 
>but I'd be surprised if it made that much difference.
>
>This sounds as if it would be a pretty simple program; if so, you could 
>post it (or a reasonable facsimile, if there's a reason you can't post 
>the actual code) here, and we would be better able to help.

I'm posting to cla from home. The program is at work. I'll post the
code when I get back to work.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-29 21:46   ` Fionn Mac Cumhaill
@ 2007-05-01 14:10     ` Fionn Mac Cumhaill
  2007-05-06 21:55     ` Quarc
  1 sibling, 0 replies; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-05-01 14:10 UTC (permalink / raw)


On Sun, 29 Apr 2007 21:46:29 GMT, Fionn Mac Cumhaill
<invisible@hiding.from.spam> wrote:

>On Sat, 28 Apr 2007 19:12:33 GMT, "Jeffrey R. Carter"
><jrcarter@acm.org> wrote:
>
>>Fionn Mac Cumhaill wrote:
>>> 
>>> All it does is read lines in a loop from a text file with
>>> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
>>> lines that it reads, and writes the lines to an output file with
>>> Put_Line. 
>>> 
>>> The modifications consist of replacing a slice of text at the end of a
>>> line with another bit of text. The biggest slice is 10 characters, and
>>> the replacement slice is always smaller than the original slice. An
>>> occasional line of text is about 6000 characters long, but most are
>>> about 700 haracters.  Get_Line reads them into a String variable that
>>> is 10,000 characters long.
>>> 
>>> The problem is that the input file has more than 10 million lines of
>>> text in it. The program works perfectly, but takes about 5 hours to
>>> run. The Cygwin version of wc can count the lines in the input file in
>>> less than one minute.
>>> 
>>> Why is this so slow? 
>>> Do I have an Ada problem, a GNAT problem, or a MinGW problem?
>>
>>It's hard to tell. Text_IO is quite heavy-weight compared to C text I/O, 
>>but I'd be surprised if it made that much difference.
>>
>>This sounds as if it would be a pretty simple program; if so, you could 
>>post it (or a reasonable facsimile, if there's a reason you can't post 
>>the actual code) here, and we would be better able to help.
>
>I'm posting to cla from home. The program is at work. I'll post the
>code when I get back to work.

Well, I was too pressed for time to do any one-change-at-a-time
experimentation; I removed the buffer initialization, and changed  the
Index searches to use Backward. I removed some other uses of Index, as
the slice for which I was looking was at a fixed location. Run time
fell to 8 minutes.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-29 21:46   ` Fionn Mac Cumhaill
  2007-05-01 14:10     ` Fionn Mac Cumhaill
@ 2007-05-06 21:55     ` Quarc
  1 sibling, 0 replies; 45+ messages in thread
From: Quarc @ 2007-05-06 21:55 UTC (permalink / raw)

Fionn Mac Cumhaill wrote:
> On Sat, 28 Apr 2007 19:12:33 GMT, "Jeffrey R. Carter"
> <jrcarter@acm.org> wrote:
> 
>> Fionn Mac Cumhaill wrote:
>>> All it does is read lines in a loop from a text file with
>>> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
>>> lines that it reads, and writes the lines to an output file with
>>> Put_Line. 
>>>
>>> The modifications consist of replacing a slice of text at the end of a
>>> line with another bit of text. The biggest slice is 10 characters, and
>>> the replacement slice is always smaller than the original slice. An
>>> occasional line of text is about 6000 characters long, but most are
>>> about 700 haracters.  Get_Line reads them into a String variable that
>>> is 10,000 characters long.
>>>
>>> The problem is that the input file has more than 10 million lines of
>>> text in it. The program works perfectly, but takes about 5 hours to
>>> run. The Cygwin version of wc can count the lines in the input file in
>>> less than one minute.
>>>
>>> Why is this so slow? 
>>> Do I have an Ada problem, a GNAT problem, or a MinGW problem?

Sorry, but I don't have your full question here so trying to figure out 
your problem for the quote above. I woul assume the big difference 
between what you are doing and what WC does is that wc only reads the 
file, and can therefore do it in one go.

I am speculation now, but I think one explanation for this would be that 
you are reading, then writing each line by itself. Depending on the OS 
you are running on, and also the binding to the OS this could 
potentially create a loot of seek times on the disk. This could explain 
the long times I think. One of two things happen I think (or potentially 
both). 1) You read 700 bytes from disk, you write 700 bytes to disk. 
Assume average seek times of 5 ms and you get to 2*50 000 s = 100 000 s 
= 28 hours !!!! 2) Instead you creating the seektimes on the disk 
between reads, ther are other processes waiting for disk access in the 
OS as well. SO qite often some other process get to the top of the 
waiting queue, and can access disk, causing the seektimes.

If one of the above is causing your problem, the only solution is to 
read in larger chunks at once, and then write larger chunks of data as 
well. I don't remember how to set that up, and it is of course depending 
on compiler and OS, but hopefully you should be able to make sure you 
don't read line by line from disk.

regards

Peter Atterfjï¿½ll
(haven't been working professionally with Ada for 14 years now :-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-04-28  5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill
                   ` (6 preceding siblings ...)
  2007-04-28 19:12 ` Jeffrey R. Carter
@ 2007-05-02  7:46 ` george
  2007-05-03  6:31   ` Fionn Mac Cumhaill
  7 siblings, 1 reply; 45+ messages in thread
From: george @ 2007-05-02  7:46 UTC (permalink / raw)


Fionn Mac Cumhaill wrote:
> All it does is read lines in a loop from a text file with
> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
> lines that it reads, and writes the lines to an output file with
> Put_Line.
Looks like a perfect task for sed utility to me. Even if sed does not
do all you need you can at the very least check some simple sed
transformation on your data, which should give you an estimate of what
run times you can expect with your file.
I am not familiar with MinGW environment. If it is anything like
CygWin then you should be able to run GNU tools and perform that test
with sed. At least this should answer the

> Do I have an Ada problem, a GNAT problem, or a MinGW problem?

(and CygWin does tend to be slower than native computation, however I
used it only a few times).


George




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-02  7:46 ` george
@ 2007-05-03  6:31   ` Fionn Mac Cumhaill
  2007-05-03 20:00     ` Simon Wright
  2007-05-03 20:27     ` Reading and writing a big file in Ada (GNAT) on Windows XP Adam Beneschan
  0 siblings, 2 replies; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-05-03  6:31 UTC (permalink / raw)

On 2 May 2007 00:46:07 -0700, george@gentoo.org wrote:

>Fionn Mac Cumhaill wrote:
>> All it does is read lines in a loop from a text file with
>> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the
>> lines that it reads, and writes the lines to an output file with
>> Put_Line.
>Looks like a perfect task for sed utility to me. Even if sed does not
>do all you need you can at the very least check some simple sed
>transformation on your data, which should give you an estimate of what
>run times you can expect with your file.
>I am not familiar with MinGW environment. If it is anything like
>CygWin then you should be able to run GNU tools and perform that test
>with sed. At least this should answer the
>
>> Do I have an Ada problem, a GNAT problem, or a MinGW problem?
>
>(and CygWin does tend to be slower than native computation, however I
>used it only a few times).
>
>
>George
I considered sed, but I know next to nothing about it, and decided
that by the time I learned enough to use it, I could have done the job
in Ada several times over.

I did do further investigation; I made a copy of the now-working
program and threw most of the program away, leaving only a very simple
program which read the large input file, but made no changes and did
no output. I added code to track the run time and put the buffer clear
back in. It read the 10 million lines in just a little over five
minutes. I then put  Index back and used it to search the buffer for a
short string that would never be found, seatching forwards from the
beginning of the input buffer. Bingo. Run time increased to a bit more
than 1-1/2 hours. 

One of the earlier commenters observed that there was a bug in GNAT's
Index. It will be interesting to try this again once the bug is fixed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-03  6:31   ` Fionn Mac Cumhaill
@ 2007-05-03 20:00     ` Simon Wright
  2007-05-04  4:35       ` Jeffrey R. Carter
                         ` (2 more replies)
  2007-05-03 20:27     ` Reading and writing a big file in Ada (GNAT) on Windows XP Adam Beneschan
  1 sibling, 3 replies; 45+ messages in thread
From: Simon Wright @ 2007-05-03 20:00 UTC (permalink / raw)


Fionn Mac Cumhaill <invisible@hiding.from.spam> writes:

> One of the earlier commenters observed that there was a bug in
> GNAT's Index. It will be interesting to try this again once the bug
> is fixed.

That was me. I could give you my version of Index if you want ..



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-03 20:00     ` Simon Wright
@ 2007-05-04  4:35       ` Jeffrey R. Carter
  2007-05-04  4:45       ` Fionn Mac Cumhaill
  2007-05-04  6:53       ` Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) Jacob Sparre Andersen
  2 siblings, 0 replies; 45+ messages in thread
From: Jeffrey R. Carter @ 2007-05-04  4:35 UTC (permalink / raw)


Simon Wright wrote:
> 
> That was me. I could give you my version of Index if you want ..

Or he could try PragmARC.Quick_Searcher. The algorithm it implements is 
supposed to be faster than Boyer-Moore "for most reasonable search 
patterns".

-- 
Jeff Carter
"Mr. President, we must not allow a mine-shaft gap!"
Dr. Strangelove
33



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-03 20:00     ` Simon Wright
  2007-05-04  4:35       ` Jeffrey R. Carter
@ 2007-05-04  4:45       ` Fionn Mac Cumhaill
  2007-05-04  6:53       ` Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) Jacob Sparre Andersen
  2 siblings, 0 replies; 45+ messages in thread
From: Fionn Mac Cumhaill @ 2007-05-04  4:45 UTC (permalink / raw)


On Thu, 03 May 2007 21:00:25 +0100, Simon Wright
<simon.j.wright@mac.com> wrote:

>Fionn Mac Cumhaill <invisible@hiding.from.spam> writes:
>
>> One of the earlier commenters observed that there was a bug in
>> GNAT's Index. It will be interesting to try this again once the bug
>> is fixed.
>
>That was me. I could give you my version of Index if you want ..
I would indeed like to see it.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP)
  2007-05-03 20:00     ` Simon Wright
  2007-05-04  4:35       ` Jeffrey R. Carter
  2007-05-04  4:45       ` Fionn Mac Cumhaill
@ 2007-05-04  6:53       ` Jacob Sparre Andersen
  2007-05-04  7:41         ` Dmitry A. Kazakov
  2 siblings, 1 reply; 45+ messages in thread
From: Jacob Sparre Andersen @ 2007-05-04  6:53 UTC (permalink / raw)


Simon Wright <simon.j.wright@mac.com> wrote:

> That was me. I could give you my version of Index if you want ..

I would like to have a copy as well.

My demonstration of how you can make a fast implementation of the Unix
command "tail" with POSIX.Memory_Mapping was not very successful, and
I suspect that the sinner may be Index.

Greetings,

Jacob
-- 
Ada - the programming language still ahead of its time.

-- tail1.adb - show the last ten lines from a named text file

with Ada.Characters.Latin_1;
with Ada.Command_Line;
with Ada.Strings.Fixed;
with Ada.Text_IO;
with POSIX;
with POSIX.IO;
with POSIX.Memory_Mapping;
with System;
with System.Storage_Elements;

procedure Tail is
   use POSIX;
   use POSIX.IO;
   use POSIX.Memory_Mapping;
   use System.Storage_Elements;

   Text_File    : File_Descriptor;
   Text_Size    : System.Storage_Elements.Storage_Offset;
   Text_Address : System.Address;
begin
   Text_File := Open (Name => To_POSIX_String (Ada.Command_Line.Argument (1)),
                      Mode => Read_Only);
   Text_Size := Storage_Offset (File_Size (Text_File));
   Text_Address := Map_Memory (Length     => Text_Size,
                               Protection => Allow_Read,
                               Mapping    => Map_Shared,
                               File       => Text_File,
                               Offset     => 0);

   declare
      use Ada.Strings;
      use Ada.Strings.Fixed;
      package Latin_1 renames Ada.Characters.Latin_1;
      Bit_Count       : constant Natural :=
                          Natural (Text_Size) * Storage_Element'Size;
      Character_Count : constant Natural :=
                          Bit_Count / Character'Size;

      Text : String (1 .. Character_Count);
      for Text'Address use Text_Address;
      Last : Natural := Text'Last + 1;
   begin
      for Lines in 1 .. 10 loop
         Last := Index (Source  => Text (Text'First .. Last - 1),
                        Pattern => (1 => Latin_1.LF),
                        Going   => Backward);
         exit when Last = 0;
      end loop;

      declare
         POSIX_Text : POSIX_String (Text'Range);
         for POSIX_Text'Address use Text'Address;
      begin
         while Last < Text'Last loop
            NONSTANDARD_Write
              (File   => Standard_Output,
               Buffer => POSIX_Text (Last + 1 .. Text'Last),
               Last   => Last);
         end loop;
      end;
   end;

   Unmap_Memory (First  => Text_Address,
                 Length => Text_Size);
   Close (File => Text_File);
end Tail;



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP)
  2007-05-04  6:53       ` Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) Jacob Sparre Andersen
@ 2007-05-04  7:41         ` Dmitry A. Kazakov
  2007-05-04  9:16           ` Copying string slices before calling subroutines? (Was: Alternative Index implementation?) Jacob Sparre Andersen
  0 siblings, 1 reply; 45+ messages in thread
From: Dmitry A. Kazakov @ 2007-05-04  7:41 UTC (permalink / raw)

On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote:

> My demonstration of how you can make a fast implementation of the Unix
> command "tail" with POSIX.Memory_Mapping was not very successful, and
> I suspect that the sinner may be Index.

[...]

>      for Lines in 1 .. 10 loop
>         Last := Index (Source  => Text (Text'First .. Last - 1),

Did you check if GNAT did not copy the string slice before calling to
Index?

>                        Pattern => (1 => Latin_1.LF),
>                        Going   => Backward);
>         exit when Last = 0;
>      end loop;

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Copying string slices before calling subroutines? (Was: Alternative Index implementation?)
  2007-05-04  7:41         ` Dmitry A. Kazakov
@ 2007-05-04  9:16           ` Jacob Sparre Andersen
  2007-05-04  9:44             ` Copying string slices before calling subroutines? Jacob Sparre Andersen
  0 siblings, 1 reply; 45+ messages in thread
From: Jacob Sparre Andersen @ 2007-05-04  9:16 UTC (permalink / raw)


Dmitry A. Kazakov wrote:
> On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote:

>>      for Lines in 1 .. 10 loop
>>         Last := Index (Source  => Text (Text'First .. Last - 1),
>
> Did you check if GNAT did not copy the string slice before calling to
> Index?

No.  Do you have any suggestion for a way to check it.  The reference
manual (section 6.2) seems to indicate that it is up to the compiler
to decide.

Greetings,

Jacob
-- 
Beware of people with Gnus, they may Hurd you.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Copying string slices before calling subroutines?
  2007-05-04  9:16           ` Copying string slices before calling subroutines? (Was: Alternative Index implementation?) Jacob Sparre Andersen
@ 2007-05-04  9:44             ` Jacob Sparre Andersen
  2007-05-04 10:14               ` Dmitry A. Kazakov
  0 siblings, 1 reply; 45+ messages in thread
From: Jacob Sparre Andersen @ 2007-05-04  9:44 UTC (permalink / raw)

Jacob Sparre Andersen wrote:
> Dmitry A. Kazakov wrote:
>> On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote:

>>>      for Lines in 1 .. 10 loop
>>>         Last := Index (Source  => Text (Text'First .. Last - 1),
>>
>> Did you check if GNAT did not copy the string slice before calling to
>> Index?
>
> No.  Do you have any suggestion for a way to check it.  The
> reference manual (section 6.2) seems to indicate that it is up to
> the compiler to decide.

I have found a strong indication that GNAT copies the whole string
slice onto the stack before calling Index:

It is limited how large files I can run my Tail program on, before I
get a segmentation fault.  The limit is roughly the allowed stack size
on the system (and changes as I change the allowed stack size).

This version appears to be as fast as the GNU Tail program.  The only
thing which worries me a bit is that I see roughly twice as many
"minor pagefaults" with my own version as with the GNU version.

Greetings,

Jacob

PS: I don't feel like cheating and looking at the GNU source code yet.
-- 
"Then, after a second or so, nothing continued to happen."

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Copying string slices before calling subroutines?
  2007-05-04  9:44             ` Copying string slices before calling subroutines? Jacob Sparre Andersen
@ 2007-05-04 10:14               ` Dmitry A. Kazakov
  2007-05-04 12:07                 ` Jeffrey Creem
  0 siblings, 1 reply; 45+ messages in thread
From: Dmitry A. Kazakov @ 2007-05-04 10:14 UTC (permalink / raw)


On Fri, 04 May 2007 11:44:14 +0200, Jacob Sparre Andersen wrote:

> Jacob Sparre Andersen wrote:
>> Dmitry A. Kazakov wrote:
>>> On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote:
> 
>>>>      for Lines in 1 .. 10 loop
>>>>         Last := Index (Source  => Text (Text'First .. Last - 1),
>>>
>>> Did you check if GNAT did not copy the string slice before calling to
>>> Index?
>>
>> No.  Do you have any suggestion for a way to check it.  The
>> reference manual (section 6.2) seems to indicate that it is up to
>> the compiler to decide.
> 
> I have found a strong indication that GNAT copies the whole string
> slice onto the stack before calling Index:

So, Index was innocent.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Copying string slices before calling subroutines?
  2007-05-04 10:14               ` Dmitry A. Kazakov
@ 2007-05-04 12:07                 ` Jeffrey Creem
  2007-05-04 12:46                   ` Dmitry A. Kazakov
  2007-05-04 22:27                   ` Simon Wright
  0 siblings, 2 replies; 45+ messages in thread
From: Jeffrey Creem @ 2007-05-04 12:07 UTC (permalink / raw)


Dmitry A. Kazakov wrote:
> On Fri, 04 May 2007 11:44:14 +0200, Jacob Sparre Andersen wrote:
> 
>> Jacob Sparre Andersen wrote:
>>> Dmitry A. Kazakov wrote:
>>>> On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote:
>>>>>      for Lines in 1 .. 10 loop
>>>>>         Last := Index (Source  => Text (Text'First .. Last - 1),
>>>> Did you check if GNAT did not copy the string slice before calling to
>>>> Index?
>>> No.  Do you have any suggestion for a way to check it.  The
>>> reference manual (section 6.2) seems to indicate that it is up to
>>> the compiler to decide.
>> I have found a strong indication that GNAT copies the whole string
>> slice onto the stack before calling Index:
> 
> So, Index was innocent.
> 

I don't know that I'd say that...Though it might be innocent in this 
case, a quick look at the source code (which then remaps index to a 
common string search package) looks to me like there are other 
opportunities where some versions of the Index call will make another 
copy in cases when it need not and further this does not appear to match 
the comments in the package.

Specifically, in the trunk of the current FSF Ada in a-strsea.adb at the 
top of the package it says (formatting adjusted on this comment)

--  Note: This code is derived from the ADAR.CSH public domain Ada 83
--  versions of the Appendix C string handling packages (code extracted
--  from Ada.Strings.Fixed). A significant change is that we optimize
--  the case of identity mappings for Count and Index, and also
--  Index_Non_Blank is specialized (rather than using the
--   general Index routine).


Great! Sounds promising...Lets take a look at one of them


  function Index
      (Source  : String;
       Pattern : String;
       Going   : Direction := Forward;
       Mapping : Maps.Character_Mapping := Maps.Identity) return Natural
    is
       Cur_Index     : Natural;
       Mapped_Source : String (Source'Range);

    begin
       if Pattern = "" then
          raise Pattern_Error;
       end if;

       for J in Source'Range loop
          Mapped_Source (J) := Value (Mapping, Source (J));
       end loop;

       --  Forwards case

       if Going = Forward then
          for J in 1 .. Source'Length - Pattern'Length + 1 loop
             Cur_Index := Source'First + J - 1;

             if Pattern = Mapped_Source
                            (Cur_Index .. Cur_Index + Pattern'Length - 1)
             then
                return Cur_Index;
             end if;
          end loop;

       --  Backwards case

       else
          for J in reverse 1 .. Source'Length - Pattern'Length + 1 loop
             Cur_Index := Source'First + J - 1;

             if Pattern = Mapped_Source
                            (Cur_Index .. Cur_Index + Pattern'Length - 1)
             then
                return Cur_Index;
             end if;
          end loop;
       end if;

       --  Fall through if no match found. Note that the loops are skipped
       --  completely in the case of the pattern being longer than the 
source.

       return 0;
    end Index;



Hmm...I don't see the optimization for identity mappings anyplace...



You can probably stop reading here as the formatting is ugly but the 
license for the above code snippet is:


--          Copyright (C) 1992-2005, Free Software Foundation, Inc. 
     --
-- 
     --
-- GNAT is free software;  you can  redistribute it  and/or modify it 
under --
-- terms of the  GNU General Public License as published  by the Free 
Soft- --
-- ware  Foundation;  either version 2,  or (at your option) any later 
ver- --
-- sion.  GNAT is distributed in the hope that it will be useful, but 
WITH- --
-- OUT ANY WARRANTY;  without even the  implied warranty of 
MERCHANTABILITY --
-- or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public 
License --
-- for  more details.  You should have  received  a copy of the GNU 
General --
-- Public License  distributed with GNAT;  see file COPYING.  If not, 
write --
-- to  the  Free Software Foundation,  51  Franklin  Street,  Fifth 
Floor, --
-- Boston, MA 02110-1301, USA. 
     --
-- 
     --
-- As a special exception,  if other files  instantiate  generics from 
this --
-- unit, or you link  this unit with other files  to produce an 
executable, --
-- this  unit  does not  by itself cause  the resulting  executable  to 
  be --
-- covered  by the  GNU  General  Public  License.  This exception does 
not --
-- however invalidate  any other reasons why  the executable file  might 
be --
-- covered by the  GNU Public License. 
     --
-- 
     --
-- GNAT was originally developed  by the GNAT team at  New York 
University. --
-- Extensive contributions were provided by Ada Core Technologies Inc. 
     --



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Copying string slices before calling subroutines?
  2007-05-04 12:07                 ` Jeffrey Creem
@ 2007-05-04 12:46                   ` Dmitry A. Kazakov
  2007-05-04 22:27                   ` Simon Wright
  1 sibling, 0 replies; 45+ messages in thread
From: Dmitry A. Kazakov @ 2007-05-04 12:46 UTC (permalink / raw)


On Fri, 04 May 2007 08:07:59 -0400, Jeffrey Creem wrote:

> Dmitry A. Kazakov wrote:
>> On Fri, 04 May 2007 11:44:14 +0200, Jacob Sparre Andersen wrote:
>> 
>>> Jacob Sparre Andersen wrote:
>>>> Dmitry A. Kazakov wrote:
>>>>> On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote:
>>>>>>      for Lines in 1 .. 10 loop
>>>>>>         Last := Index (Source  => Text (Text'First .. Last - 1),
>>>>> Did you check if GNAT did not copy the string slice before calling to
>>>>> Index?
>>>> No.  Do you have any suggestion for a way to check it.  The
>>>> reference manual (section 6.2) seems to indicate that it is up to
>>>> the compiler to decide.
>>> I have found a strong indication that GNAT copies the whole string
>>> slice onto the stack before calling Index:
>> 
>> So, Index was innocent.
> 
> I don't know that I'd say that... Though it might be innocent in this 
> case, a quick look at the source code (which then remaps index to a 
> common string search package) looks to me like there are other 
> opportunities where some versions of the Index call will make another 
> copy in cases when it need not and further this does not appear to match 
> the comments in the package.

Devastating. I take my words back.
 
> Hmm...I don't see the optimization for identity mappings anyplace...

It looks that they call "optimization" copying Source into Mapped_Source
with remapping. Probably they hoped that comparing Pattern with a remapping
slice could translate into a processor-specific string comparison
instruction. Anyway...

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Copying string slices before calling subroutines?
  2007-05-04 12:07                 ` Jeffrey Creem
  2007-05-04 12:46                   ` Dmitry A. Kazakov
@ 2007-05-04 22:27                   ` Simon Wright
  2007-05-05  7:33                     ` Jacob Sparre Andersen
  2007-05-05  7:41                     ` Dmitry A. Kazakov
  1 sibling, 2 replies; 45+ messages in thread
From: Simon Wright @ 2007-05-04 22:27 UTC (permalink / raw)


Jeffrey Creem <jeff@thecreems.com> writes:

>  function Index
>      (Source  : String;
>       Pattern : String;
>       Going   : Direction := Forward;
>       Mapping : Maps.Character_Mapping := Maps.Identity) return Natural
>    is
>       Cur_Index     : Natural;
>       Mapped_Source : String (Source'Range);

There's the problem that I had (in my case Source'Length was >
2_000_000!) calling Index from one of my tasks.

My reworking begins

   function Index
     (Source  : String;
      Pattern : String;
      Going   : Ada.Strings.Direction := Ada.Strings.Forward;
      Mapping : Ada.Strings.Maps.Character_Mapping
        := Ada.Strings.Maps.Identity) return Natural
   is
      Cur_Index       : Natural;
      Potential_Match : Boolean;
      use Ada.Strings;
      use Ada.Strings.Maps;
   begin
      if Pattern = "" then
         raise Pattern_Error;
      end if;

      --  Forwards case

      if Going = Forward then
         for J in 1 .. Source'Length - Pattern'Length + 1 loop
            Cur_Index := Source'First + J - 1;
            Potential_Match := True;
            for K in Pattern'Range loop
               if Pattern (K) /=
                 Value (Mapping, Source (Cur_Index + K - 1)) then
                  Potential_Match := False;
                  exit;
               end if;
            end loop;
            if Potential_Match then
               return Cur_Index;
            end if;
         end loop;

which calls Ada.Strings.Maps.Value rather more often than I suppose it
could. But since the normal case is for the mapping to be the identity
mapping the most useful optimisation would be to check for Identity
and just compare directly.


Someone else was asking whether GNAT copies on the stack before the
call -- I see no evidence of that. I had to provide my own
implementation of this Ada05 routine because my code has to be Ada95,
and it shows no sign of excessive stack use:

   function Index
     (Source : String;
      Pattern : String;
      From : Positive;
      Going : Ada.Strings.Direction := Ada.Strings.Forward;
      Mapping : Ada.Strings.Maps.Character_Mapping
        := Ada.Strings.Maps.Identity)
     return Natural is
      Candidate : String renames Source (From .. Source'Last);
   begin
      return Index (Source => Candidate,
                    Pattern => Pattern,
                    Going => Going,
                    Mapping => Mapping);
   end Index;



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Copying string slices before calling subroutines?
  2007-05-04 22:27                   ` Simon Wright
@ 2007-05-05  7:33                     ` Jacob Sparre Andersen
  2007-05-05  7:47                       ` Dmitry A. Kazakov
  2007-05-05  7:41                     ` Dmitry A. Kazakov
  1 sibling, 1 reply; 45+ messages in thread
From: Jacob Sparre Andersen @ 2007-05-05  7:33 UTC (permalink / raw)


Simon Wright <simon.j.wright@mac.com> writes:
> Jeffrey Creem <jeff@thecreems.com> writes:

>>  function Index
>>      (Source  : String;
>>       Pattern : String;
>>       Going   : Direction := Forward;
>>       Mapping : Maps.Character_Mapping := Maps.Identity) return Natural
>>    is
>>       Cur_Index     : Natural;
>>       Mapped_Source : String (Source'Range);
>
> There's the problem that I had (in my case Source'Length was >
> 2_000_000!) calling Index from one of my tasks.
>
> My reworking begins
[...]
> which calls Ada.Strings.Maps.Value rather more often than I suppose
> it could. But since the normal case is for the mapping to be the
> identity mapping the most useful optimisation would be to check for
> Identity and just compare directly.

Wouldn't the proper solution be to separate out Mapping =
Maps.Identity and Mapping /= Maps.Identity?  Even though we can do
cool things with default values for arguments, it isn't always
efficient.

   function Index
     (Source  : String;
      Pattern : String;
      Going   : Direction := Forward) return Natural;

   function Index
     (Source  : String;
      Pattern : String;
      Going   : Direction := Forward;
      Mapping : Maps.Character_Mapping) return Natural;

I can't see that there can be any cases, where existing code will work
differently with these two specifications instead of the existing
single specification.

We should also remember that for a compiler writer team, Mapping =
Maps.Identity may not be the normal case.

> Someone else was asking whether GNAT copies on the stack before the
> call -- I see no evidence of that.

Now that the implementation of Index has been pointed out to me, I
agree.  I could just see that it happened at some point during the
call to Index.

Greetings,

Jacob
-- 
ï¿½For there are only two reasons why war is made against a
 republic: The one, to become lord over her: the other, the
 fear of being occupied by her.ï¿½       -- Nicolo Machiavelli



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Copying string slices before calling subroutines?
  2007-05-05  7:33                     ` Jacob Sparre Andersen
@ 2007-05-05  7:47                       ` Dmitry A. Kazakov
  0 siblings, 0 replies; 45+ messages in thread
From: Dmitry A. Kazakov @ 2007-05-05  7:47 UTC (permalink / raw)


On Sat, 05 May 2007 09:33:26 +0200, Jacob Sparre Andersen wrote:

> We should also remember that for a compiler writer team, Mapping =
> Maps.Identity may not be the normal case.

An alternative could be to change A.4.3. towards two Index functions
instead of one:

function Index (Source : in String; 
Pattern : in String; 
Going : in Direction := Forward; 
Mapping : in Maps.Character_Mapping) 
return Natural; 
 
function Index (Source : in String; 
Pattern : in String; 
Going : in Direction := Forward)
return Natural; 

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Copying string slices before calling subroutines?
  2007-05-04 22:27                   ` Simon Wright
  2007-05-05  7:33                     ` Jacob Sparre Andersen
@ 2007-05-05  7:41                     ` Dmitry A. Kazakov
  1 sibling, 0 replies; 45+ messages in thread
From: Dmitry A. Kazakov @ 2007-05-05  7:41 UTC (permalink / raw)


On Fri, 04 May 2007 23:27:57 +0100, Simon Wright wrote:

> My reworking begins
> 
>    function Index
>      (Source  : String;
>       Pattern : String;
>       Going   : Ada.Strings.Direction := Ada.Strings.Forward;
>       Mapping : Ada.Strings.Maps.Character_Mapping
>         := Ada.Strings.Maps.Identity) return Natural
>    is
>       Cur_Index       : Natural;
>       Potential_Match : Boolean;
>       use Ada.Strings;
>       use Ada.Strings.Maps;
>    begin
>       if Pattern = "" then
>          raise Pattern_Error;
>       end if;
> 
>       --  Forwards case
> 
>       if Going = Forward then
>          for J in 1 .. Source'Length - Pattern'Length + 1 loop
>             Cur_Index := Source'First + J - 1;
>             Potential_Match := True;
>             for K in Pattern'Range loop
>                if Pattern (K) /=
>                  Value (Mapping, Source (Cur_Index + K - 1)) then
>                   Potential_Match := False;
>                   exit;
>                end if;
>             end loop;
>             if Potential_Match then
>                return Cur_Index;
>             end if;
>          end loop;
> 
> which calls Ada.Strings.Maps.Value rather more often than I suppose it
> could.

You can save remapping of Source, just there is no need to do it in advance
and allocate a full copy of Source. Instead of that, you make a ring buffer
of mod Pattern'Length where you store Value (Mapping, Source (i)). Once you
advance the main string index you "rotate" the buffer.

Then Pattern'Length = 1 should be a special case, as well as identity
mapping. Though, I don't know how efficient the latter could be.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-03  6:31   ` Fionn Mac Cumhaill
  2007-05-03 20:00     ` Simon Wright
@ 2007-05-03 20:27     ` Adam Beneschan
  2007-05-03 23:01       ` Randy Brukardt
  1 sibling, 1 reply; 45+ messages in thread
From: Adam Beneschan @ 2007-05-03 20:27 UTC (permalink / raw)

On May 2, 11:31 pm, Fionn Mac Cumhaill <invisi...@hiding.from.spam>
wrote:

> I did do further investigation; I made a copy of the now-working
> program and threw most of the program away, leaving only a very simple
> program which read the large input file, but made no changes and did
> no output. I added code to track the run time and put the buffer clear
> back in. It read the 10 million lines in just a little over five
> minutes. I then put  Index back and used it to search the buffer for a
> short string that would never be found, seatching forwards from the
> beginning of the input buffer. Bingo. Run time increased to a bit more
> than 1-1/2 hours.

It strikes me that Index is the kind of function that really ought to
be written in assembly language, at least partially.  I notice that
the version of Linux that I'm using has a built-in function to search
memory for a substring; this is very descriptively called memmem() and
has the amusing profile

   void *memmem (const void *needle, size_t needlelen,
                            const void *haystack, size_t haystacklen);

according to the man page.  But I assume this is written to use
registers optimally and take advantage of the REP instructions (on an
x86 or Pentium).  I don't know how GNAT implements Index---I haven't
looked into it.

                                    -- Adam

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-03 20:27     ` Reading and writing a big file in Ada (GNAT) on Windows XP Adam Beneschan
@ 2007-05-03 23:01       ` Randy Brukardt
  2007-05-04  0:28         ` Markus E Leypold
  2007-05-04 20:04         ` Adam Beneschan
  0 siblings, 2 replies; 45+ messages in thread
From: Randy Brukardt @ 2007-05-03 23:01 UTC (permalink / raw)


"Adam Beneschan" <adam@irvine.com> wrote in message
news:1178224048.034635.39010@l77g2000hsb.googlegroups.com...
...
> It strikes me that Index is the kind of function that really ought to
> be written in assembly language, at least partially.  I notice that
> the version of Linux that I'm using has a built-in function to search
> memory for a substring; this is very descriptively called memmem() and
> has the amusing profile
>
>    void *memmem (const void *needle, size_t needlelen,
>                             const void *haystack, size_t haystacklen);
>
> according to the man page.  But I assume this is written to use
> registers optimally and take advantage of the REP instructions (on an
> x86 or Pentium).  I don't know how GNAT implements Index---I haven't
> looked into it.

The big expense in Index is the mapping set or function, not the actual
compare. For Janus/Ada, I had seen a similar problem (a big deal as Index
was used to look for spam patterns), and finally special-cased a number of
common cases (no mapping, single character patterns, and so on). I also
spent a bit of time on the code generator, figuring that this sort of string
manipulation code is common enough that it might as well be generated well.
The updates helped a lot, although they don't quite generate a single
instruction such as is possible. (OTOH, Intel used to recommend avoiding the
block move and compare instructions because they fouled up the pipeline and
thus slowed the overall execution. I don't know if that is still true, but
ifi it is, there might be less benefit to hand-coded assembler than you are
thinking...)

                               Randy.





^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-03 23:01       ` Randy Brukardt
@ 2007-05-04  0:28         ` Markus E Leypold
  2007-05-05 16:26           ` Adam Beneschan
  2007-05-04 20:04         ` Adam Beneschan
  1 sibling, 1 reply; 45+ messages in thread
From: Markus E Leypold @ 2007-05-04  0:28 UTC (permalink / raw)




"Randy Brukardt" <randy@rrsoftware.com> writes:

> "Adam Beneschan" <adam@irvine.com> wrote in message
> news:1178224048.034635.39010@l77g2000hsb.googlegroups.com...
> ...
>> It strikes me that Index is the kind of function that really ought to
>> be written in assembly language, at least partially.  I notice that
>> the version of Linux that I'm using has a built-in function to search
>> memory for a substring; this is very descriptively called memmem() and
>> has the amusing profile
>>
>>    void *memmem (const void *needle, size_t needlelen,
>>                             const void *haystack, size_t haystacklen);
>>
>> according to the man page.  But I assume this is written to use
>> registers optimally and take advantage of the REP instructions (on an
>> x86 or Pentium).  I don't know how GNAT implements Index---I haven't
>> looked into it.
>
> The big expense in Index is the mapping set or function, not the actual
> compare. For Janus/Ada, I had seen a similar problem (a big deal as Index
> was used to look for spam patterns), and finally special-cased a number of
> common cases (no mapping, single character patterns, and so on). I also
> spent a bit of time on the code generator, figuring that this sort of string
> manipulation code is common enough that it might as well be generated well.
> The updates helped a lot, although they don't quite generate a single
> instruction such as is possible. (OTOH, Intel used to recommend avoiding the
> block move and compare instructions because they fouled up the pipeline and
> thus slowed the overall execution. I don't know if that is still true, but
> ifi it is, there might be less benefit to hand-coded assembler than you are
> thinking...)


I'd like to add, that some years ago I read a rather well written
paper about this kind of text searches. Just repeating comparisons in
two loops (one to move the offset in haystack, the other to compare
need against haystack+offset) is actually not the best method. I.e. if
one fails when comparing this

     haystack: ... asjkajkaewudajksdkwueqeqweadasdqw3adkadkakd
     needle:                                 adasdqwsdfklsdf
                                                    ^
                                                   comparison fails

one can immediately increase the offset by 7, since the needle does
not contain a '3' in the part that is before the point where the
comparison failed. The technique I read about, used considerations of
this kind to compile the 'needle' into some automaton which could skip
more than 1 character at a time in haystack and then executed /
interpreted this automaton. (Actually the algorithm was a bit more
tricky than this, I think: If the automaton sees a character that is
not in needle at all, it can immediately skip for the length of needle
and I think part of the trick was to start comparing from the end of
needle).

So in this case a better algorithm is probably the way to go (and I
agree, asembler and REP, which would just be the naive algorithm with
2 loops, won't cut it, at least I'd not bet on it)

Regards -- Markus






^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-04  0:28         ` Markus E Leypold
@ 2007-05-05 16:26           ` Adam Beneschan
  2007-05-05 17:27             ` Markus E Leypold
  0 siblings, 1 reply; 45+ messages in thread
From: Adam Beneschan @ 2007-05-05 16:26 UTC (permalink / raw)

On May 3, 5:28 pm, Markus E Leypold
<development-2006-8ecbb5cc8aREMOVET...@ANDTHATm-e-leypold.de> wrote:
> "Randy Brukardt" <r...@rrsoftware.com> writes:
> > "Adam Beneschan" <a...@irvine.com> wrote in message
> >news:1178224048.034635.39010@l77g2000hsb.googlegroups.com...
> > ...
> >> It strikes me that Index is the kind of function that really ought to
> >> be written in assembly language, at least partially.  I notice that
> >> the version of Linux that I'm using has a built-in function to search
> >> memory for a substring; this is very descriptively called memmem() and
> >> has the amusing profile
>
> >>    void *memmem (const void *needle, size_t needlelen,
> >>                             const void *haystack, size_t haystacklen);
>
> >> according to the man page.  But I assume this is written to use
> >> registers optimally and take advantage of the REP instructions (on an
> >> x86 or Pentium).  I don't know how GNAT implements Index---I haven't
> >> looked into it.
>
> > The big expense in Index is the mapping set or function, not the actual
> > compare. For Janus/Ada, I had seen a similar problem (a big deal as Index
> > was used to look for spam patterns), and finally special-cased a number of
> > common cases (no mapping, single character patterns, and so on). I also
> > spent a bit of time on the code generator, figuring that this sort of string
> > manipulation code is common enough that it might as well be generated well.
> > The updates helped a lot, although they don't quite generate a single
> > instruction such as is possible. (OTOH, Intel used to recommend avoiding the
> > block move and compare instructions because they fouled up the pipeline and
> > thus slowed the overall execution. I don't know if that is still true, but
> > ifi it is, there might be less benefit to hand-coded assembler than you are
> > thinking...)
>
> I'd like to add, that some years ago I read a rather well written
> paper about this kind of text searches. Just repeating comparisons in
> two loops (one to move the offset in haystack, the other to compare
> need against haystack+offset) is actually not the best method. I.e. if
> one fails when comparing this
>
>      haystack: ... asjkajkaewudajksdkwueqeqweadasdqw3adkadkakd
>      needle:                                 adasdqwsdfklsdf
>                                                     ^
>                                                    comparison fails
>
> one can immediately increase the offset by 7, since the needle does
> not contain a '3' in the part that is before the point where the
> comparison failed. The technique I read about, used considerations of
> this kind to compile the 'needle' into some automaton which could skip
> more than 1 character at a time in haystack and then executed /
> interpreted this automaton. (Actually the algorithm was a bit more
> tricky than this, I think: If the automaton sees a character that is
> not in needle at all, it can immediately skip for the length of needle
> and I think part of the trick was to start comparing from the end of
> needle).
>
> So in this case a better algorithm is probably the way to go (and I
> agree, asembler and REP, which would just be the naive algorithm with
> 2 loops, won't cut it, at least I'd not bet on it)

A big problem, however, is that while such "well-written" papers may
be mathematically correct, they must make assumptions that aren't
necessarily true in real life.  For instance, analyses that I've read
on searching algorithms show that this or that algorithm may minimize
the number of times character comparisons are performed.  But in real
life, not all character comparisons have the same cost.  If a
processor has a string comparison instruction, for instance, then
presumably the hardware has been optimized so that the characters will
be compared significantly faster than if the program did all the index
register manipulation itself and compared individual characters.
Algorithmic analysis, or at least the kind I've seen, usually doesn't
take this sort of thing into account.  If one pattern-matching
algorithm can be shown to be better than another, then (for a
particular processor) there should be some values M and N such that if
the source length is >M and the pattern length is >N, the
mathematically faster algorithm actually does run faster---but that
doesn't help if your real-life inputs are likely to have lengths
significantly less than that.  The original poster had a pattern with
a quite small length, for instance.  A technique in which you
precompile the pattern into an automaton, for instance, is likely to
make things slower when you're using such a small pattern and calling
the Index routine thousands or millions of times.

This doesn't necessarily help write the Index routine in the best
way.  In this case, an implementation *could* look at the lengths of
the strings and make a judgment as to what the best algorithm might be
(the implementor may have to experiment on each supported processor,
though, to determine the correct cutoffs).  But I think it would be a
mistake for an implementor to write Index in a way that uses some
"optimal" algorithm that makes things better only when the argument
sizes are >1000, say, and makes things slower when the sizes are less
than that.

                                       -- Adam

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-05 16:26           ` Adam Beneschan
@ 2007-05-05 17:27             ` Markus E Leypold
  2007-05-15 23:03               ` Randy Brukardt
  0 siblings, 1 reply; 45+ messages in thread
From: Markus E Leypold @ 2007-05-05 17:27 UTC (permalink / raw)



Adam Beneschan <adam@irvine.com> writes:

> On May 3, 5:28 pm, Markus E Leypold
> <development-2006-8ecbb5cc8aREMOVET...@ANDTHATm-e-leypold.de> wrote:
>> "Randy Brukardt" <r...@rrsoftware.com> writes:
>> > "Adam Beneschan" <a...@irvine.com> wrote in message
>> >news:1178224048.034635.39010@l77g2000hsb.googlegroups.com...
>> > ...
>> >> It strikes me that Index is the kind of function that really ought to
>> >> be written in assembly language, at least partially.  I notice that
>> >> the version of Linux that I'm using has a built-in function to search
>> >> memory for a substring; this is very descriptively called memmem() and
>> >> has the amusing profile
>>
>> >>    void *memmem (const void *needle, size_t needlelen,
>> >>                             const void *haystack, size_t haystacklen);
>>
>> >> according to the man page.  But I assume this is written to use
>> >> registers optimally and take advantage of the REP instructions (on an
>> >> x86 or Pentium).  I don't know how GNAT implements Index---I haven't
>> >> looked into it.
>>
>> > The big expense in Index is the mapping set or function, not the actual
>> > compare. For Janus/Ada, I had seen a similar problem (a big deal as Index
>> > was used to look for spam patterns), and finally special-cased a number of
>> > common cases (no mapping, single character patterns, and so on). I also
>> > spent a bit of time on the code generator, figuring that this sort of string
>> > manipulation code is common enough that it might as well be generated well.
>> > The updates helped a lot, although they don't quite generate a single
>> > instruction such as is possible. (OTOH, Intel used to recommend avoiding the
>> > block move and compare instructions because they fouled up the pipeline and
>> > thus slowed the overall execution. I don't know if that is still true, but
>> > ifi it is, there might be less benefit to hand-coded assembler than you are
>> > thinking...)
>>
>> I'd like to add, that some years ago I read a rather well written
>> paper about this kind of text searches. Just repeating comparisons in
>> two loops (one to move the offset in haystack, the other to compare
>> need against haystack+offset) is actually not the best method. I.e. if
>> one fails when comparing this
>>
>>      haystack: ... asjkajkaewudajksdkwueqeqweadasdqw3adkadkakd
>>      needle:                                 adasdqwsdfklsdf
>>                                                     ^
>>                                                    comparison fails
>>
>> one can immediately increase the offset by 7, since the needle does
>> not contain a '3' in the part that is before the point where the
>> comparison failed. The technique I read about, used considerations of
>> this kind to compile the 'needle' into some automaton which could skip
>> more than 1 character at a time in haystack and then executed /
>> interpreted this automaton. (Actually the algorithm was a bit more
>> tricky than this, I think: If the automaton sees a character that is
>> not in needle at all, it can immediately skip for the length of needle
>> and I think part of the trick was to start comparing from the end of
>> needle).
>>
>> So in this case a better algorithm is probably the way to go (and I
>> agree, asembler and REP, which would just be the naive algorithm with
>> 2 loops, won't cut it, at least I'd not bet on it)
>
> A big problem, however, is that while such "well-written" papers may
> be mathematically correct, they must make assumptions that aren't
> necessarily true in real life.  For instance, analyses that I've read

While that is certainly true, I don't think it applies in this case,
The algorithm I'm referring to, and which I'm to lazy to look up,
seems to be rather much in use when looking for a give substring in a
larger string/text, like when searching or searching/replacing in a
text editor.


> on searching algorithms show that this or that algorithm may minimize
> the number of times character comparisons are performed.  But in real
> life, not all character comparisons have the same cost.  If a
> processor has a string comparison instruction, for instance, then
> presumably the hardware has been optimized so that the characters will
> be compared significantly faster than if the program did all the index
> register manipulation itself and compared individual characters.

The processor hardware can only search of strings in a "dumb" way,
whereas the algorithm I have been talking about can be much
smarter. Of course I'd expect algorithms like that to have the best
(and most predictable) performance on RISC machines. And it might well
pay to implement the "interpreter" in assembly too.

> Algorithmic analysis, or at least the kind I've seen, usually doesn't
> take this sort of thing into account.  If one pattern-matching

Isn't that some kind of sweeping generalization?

> algorithm can be shown to be better than another, then (for a
> particular processor) there should be some values M and N such that if
> the source length is >M and the pattern length is >N, the
> mathematically faster algorithm actually does run faster---but that
> doesn't help if your real-life inputs are likely to have lengths
> significantly less than that.  The original poster had a pattern with
> a quite small length, for instance.  A technique in which you
> precompile the pattern into an automaton, for instance, is likely to
> make things slower when you're using such a small pattern and calling
> the Index routine thousands or millions of times.

Well -- I'm not trying to peddle that algorithm to you or anybody
else. Indeed I now wish. I've stayed silent instead. 

Originally I intended to point people, who try to find substrings to
the fact that there a much smarter algorithms available which can skip
a significant percentage of character comparisons. Wether this is
faster than brute force dumb comparison would have to be decided in
any single case. Ideally though, in the case of a substring of N
characters, when a single character comparison fails, offset can be
skipped by a K in the order of magnitude of N, that is, if you look
for a substring of N characters, you might only have to do something
on the order of 1/N of the comparisons you'd have to do in the dumb
implementation. Admittedly things are more complicated than that, but
I think one can see, that the point where it becomes as fast as the
dumb case could be reached for small N, esp. if the interpreter is
handwritten assembler or generated by a really good code generator.

As I said: I'm not trying to peddle that algorithm to anybody, so I'm
not inclined to look it up or try a better analysis. I've better
things to do, since _I_ don't have problem with searching for
substrings at the moment.

> This doesn't necessarily help write the Index routine in the best
> way.  In this case, an implementation *could* look at the lengths of
> the strings and make a judgment as to what the best algorithm might be
> (the implementor may have to experiment on each supported processor,
> though, to determine the correct cutoffs).  But I think it would be a
> mistake for an implementor to write Index in a way that uses some
> "optimal" algorithm that makes things better only when the argument
> sizes are >1000, say, and makes things slower when the sizes are less
> than that.

If I understood the "paper" right (actually it was only some dumb
article in some applied programming magazine), the algorithm had been
designed for simple word searched, i.e. substrings around 10
characters.

Regards -- Markus



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-05 17:27             ` Markus E Leypold
@ 2007-05-15 23:03               ` Randy Brukardt
  0 siblings, 0 replies; 45+ messages in thread
From: Randy Brukardt @ 2007-05-15 23:03 UTC (permalink / raw)


(I've been on vacation and just am now catching up...)

"Markus E Leypold"
<development-2006-8ecbb5cc8aREMOVETHIS@ANDTHATm-e-leypold.de> wrote in
message news:9qhcqr5gfi.fsf@hod.lan.m-e-leypold.de...
> Adam Beneschan <adam@irvine.com> writes:
>
> > On May 3, 5:28 pm, Markus E Leypold
> > <development-2006-8ecbb5cc8aREMOVET...@ANDTHATm-e-leypold.de> wrote:
> >> "Randy Brukardt" <r...@rrsoftware.com> writes:
> >> > "Adam Beneschan" <a...@irvine.com> wrote in message
> >> >news:1178224048.034635.39010@l77g2000hsb.googlegroups.com...
> >> > ...
> >> >> It strikes me that Index is the kind of function that really ought
to
> >> >> be written in assembly language, at least partially.  I notice that
> >> >> the version of Linux that I'm using has a built-in function to
search
> >> >> memory for a substring; this is very descriptively called memmem()
and
> >> >> has the amusing profile
> >>
> >> >>    void *memmem (const void *needle, size_t needlelen,
> >> >>                             const void *haystack, size_t
haystacklen);
> >>
> >> >> according to the man page.  But I assume this is written to use
> >> >> registers optimally and take advantage of the REP instructions (on
an
> >> >> x86 or Pentium).  I don't know how GNAT implements Index---I haven't
> >> >> looked into it.
> >>
> >> > The big expense in Index is the mapping set or function, not the
actual
> >> > compare. For Janus/Ada, I had seen a similar problem (a big deal as
Index
> >> > was used to look for spam patterns), and finally special-cased a
number of
> >> > common cases (no mapping, single character patterns, and so on). I
also
> >> > spent a bit of time on the code generator, figuring that this sort of
string
> >> > manipulation code is common enough that it might as well be generated
well.
> >> > The updates helped a lot, although they don't quite generate a single
> >> > instruction such as is possible. (OTOH, Intel used to recommend
avoiding the
> >> > block move and compare instructions because they fouled up the
pipeline and
> >> > thus slowed the overall execution. I don't know if that is still
true, but
> >> > ifi it is, there might be less benefit to hand-coded assembler than
you are
> >> > thinking...)
> >>
> >> I'd like to add, that some years ago I read a rather well written
> >> paper about this kind of text searches. Just repeating comparisons in
> >> two loops (one to move the offset in haystack, the other to compare
> >> need against haystack+offset) is actually not the best method. I.e. if
> >> one fails when comparing this
> >>
> >>      haystack: ... asjkajkaewudajksdkwueqeqweadasdqw3adkadkakd
> >>      needle:                                 adasdqwsdfklsdf
> >>                                                     ^
> >>                                                    comparison fails
> >>
> >> one can immediately increase the offset by 7, since the needle does
> >> not contain a '3' in the part that is before the point where the
> >> comparison failed. The technique I read about, used considerations of
> >> this kind to compile the 'needle' into some automaton which could skip
> >> more than 1 character at a time in haystack and then executed /
> >> interpreted this automaton. (Actually the algorithm was a bit more
> >> tricky than this, I think: If the automaton sees a character that is
> >> not in needle at all, it can immediately skip for the length of needle
> >> and I think part of the trick was to start comparing from the end of
> >> needle).
> >>
> >> So in this case a better algorithm is probably the way to go (and I
> >> agree, asembler and REP, which would just be the naive algorithm with
> >> 2 loops, won't cut it, at least I'd not bet on it)
> >
> > A big problem, however, is that while such "well-written" papers may
> > be mathematically correct, they must make assumptions that aren't
> > necessarily true in real life.  For instance, analyses that I've read
>
> While that is certainly true, I don't think it applies in this case,
> The algorithm I'm referring to, and which I'm to lazy to look up,
> seems to be rather much in use when looking for a give substring in a
> larger string/text, like when searching or searching/replacing in a
> text editor.

No, Adam is right, it applies in every case. A brute force search is often
cheaper than a fancier one. For instance, in the Janus/Ada parser, I
implemented a fancy binary table search in order to find items faster. It
actually made things slower. The problem was that many of the table rows
only had a few items, so the setup time (which required a number of adds and
shifts) took longer than a brute-force search (which was just a simple for
loop, one add per iteration). Hand coding that loop in assembler turned it
into a handful of instructions.

Brute force is almost always going to be better if the comparisons are cheap
and you can keep everything in registers. That's often true for Index.

...
> Originally I intended to point people, who try to find substrings to
> the fact that there a much smarter algorithms available which can skip
> a significant percentage of character comparisons. Wether this is
> faster than brute force dumb comparison would have to be decided in
> any single case. Ideally though, in the case of a substring of N
> characters, when a single character comparison fails, offset can be
> skipped by a K in the order of magnitude of N, that is, if you look
> for a substring of N characters, you might only have to do something
> on the order of 1/N of the comparisons you'd have to do in the dumb
> implementation. Admittedly things are more complicated than that, but
> I think one can see, that the point where it becomes as fast as the
> dumb case could be reached for small N, esp. if the interpreter is
> handwritten assembler or generated by a really good code generator.

Here the problem is that you have to know something about N in order to
tell. My experience is that N is usually small (often 1), and in such cases
there is likely no value to a fancy algorithm. Of course, in a general
routine like Index, pretty much anything can happen, and it is highly
unlikely that the algorithm will be optimal for every possible case.

Truth is, this demonstrates a standard moral of programming: its fine to
write your programs with a standard library routine, but you need to profile
it after doing so to see where the bottlenecks are. And if you're going to
call anything (but especially some canned library routine) millions of
times, you need to be prepared to replace it with one that is optimal for
your application. In this case, for instance, you can figure out the
distribution of N for a particular program, and implement an appropriate
algorithm. You can't expect us compiler writers to do it for you; we may
have a very different distribution of N in mind. (Of course, if the routine
isn't critical, don't sweat it -- doing otherwise is just premature
optimization.)

                             Randy.





^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-03 23:01       ` Randy Brukardt
  2007-05-04  0:28         ` Markus E Leypold
@ 2007-05-04 20:04         ` Adam Beneschan
  2007-05-05 16:36           ` tmoran
  1 sibling, 1 reply; 45+ messages in thread
From: Adam Beneschan @ 2007-05-04 20:04 UTC (permalink / raw)

On May 3, 4:01 pm, "Randy Brukardt" <r...@rrsoftware.com> wrote:

> The big expense in Index is the mapping set or function, not the actual
> compare.

Yeah, but I'd assume that a high proportion of uses of Index would use
the default parameter for mapping (Maps.Identity), which an
implementation should be able to test for and choose a faster
implementation in that case.  Which is what I assumed you did.

> For Janus/Ada, I had seen a similar problem (a big deal as Index
> was used to look for spam patterns), and finally special-cased a number of
> common cases (no mapping, single character patterns, and so on). I also
> spent a bit of time on the code generator, figuring that this sort of string
> manipulation code is common enough that it might as well be generated well.
> The updates helped a lot, although they don't quite generate a single
> instruction such as is possible. (OTOH, Intel used to recommend avoiding the
> block move and compare instructions because they fouled up the pipeline and
> thus slowed the overall execution. I don't know if that is still true, but
> ifi it is, there might be less benefit to hand-coded assembler than you are
> thinking...)

Sigh... I've been in the computer business too long.  I started out
with processors that execute the instruction you tell it to execute
and then go to the next one, so I still tend to think in those terms,
even though I now have to deal with processors that look ahead to
schedule instructions and try to execute instructions in parallel and
even have "branch delays" so that if you tell the computer to branch,
it will sit there and not budge until you give it five more
instructions, which seems WAY too much like dealing with my 8-year-old
son.  But yes, these days I need to be more careful about my
assumptions about what kinds of machine instructions will execute the
fastest.

                                             -- Adam

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Reading and writing a big file in Ada (GNAT) on Windows XP
  2007-05-04 20:04         ` Adam Beneschan
@ 2007-05-05 16:36           ` tmoran
  0 siblings, 0 replies; 45+ messages in thread
From: tmoran @ 2007-05-05 16:36 UTC (permalink / raw)


>Yeah, but I'd assume that a high proportion of uses of Index would use
>...
> But yes, these days I need to be more careful about my assumptions about
> what kinds of machine instructions will execute the fastest.
  Different versions of the same processor architecture (ie, compiler
target) are different, different uses of Index have different statistics,
with multicore it may be worth splitting up the job, .....
Sometimes it's worth rolling your own instead of using the library routine.



^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2007-05-15 23:03 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-28  5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill
2007-04-28  5:20 ` Gautier
2007-04-29 21:17   ` Fionn Mac Cumhaill
2007-04-28  5:25 ` tmoran
2007-04-28  6:56 ` Martin Krischik
2007-04-28 17:12   ` tmoran
2007-04-28 12:41 ` Jeffrey Creem
2007-04-29 21:35   ` Fionn Mac Cumhaill
2007-04-28 13:22 ` (see below)
2007-04-28 17:56 ` Simon Wright
2007-04-28 18:28   ` Jeffrey Creem
2007-04-29  7:20     ` Simon Wright
2007-04-29 21:44     ` Fionn Mac Cumhaill
2007-04-29 21:42   ` Fionn Mac Cumhaill
2007-04-30  0:48     ` Jeffrey R. Carter
2007-04-30  2:30       ` Fionn Mac Cumhaill
2007-04-30  4:21         ` tmoran
2007-04-28 19:12 ` Jeffrey R. Carter
2007-04-29 21:46   ` Fionn Mac Cumhaill
2007-05-01 14:10     ` Fionn Mac Cumhaill
2007-05-06 21:55     ` Quarc
2007-05-02  7:46 ` george
2007-05-03  6:31   ` Fionn Mac Cumhaill
2007-05-03 20:00     ` Simon Wright
2007-05-04  4:35       ` Jeffrey R. Carter
2007-05-04  4:45       ` Fionn Mac Cumhaill
2007-05-04  6:53       ` Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) Jacob Sparre Andersen
2007-05-04  7:41         ` Dmitry A. Kazakov
2007-05-04  9:16           ` Copying string slices before calling subroutines? (Was: Alternative Index implementation?) Jacob Sparre Andersen
2007-05-04  9:44             ` Copying string slices before calling subroutines? Jacob Sparre Andersen
2007-05-04 10:14               ` Dmitry A. Kazakov
2007-05-04 12:07                 ` Jeffrey Creem
2007-05-04 12:46                   ` Dmitry A. Kazakov
2007-05-04 22:27                   ` Simon Wright
2007-05-05  7:33                     ` Jacob Sparre Andersen
2007-05-05  7:47                       ` Dmitry A. Kazakov
2007-05-05  7:41                     ` Dmitry A. Kazakov
2007-05-03 20:27     ` Reading and writing a big file in Ada (GNAT) on Windows XP Adam Beneschan
2007-05-03 23:01       ` Randy Brukardt
2007-05-04  0:28         ` Markus E Leypold
2007-05-05 16:26           ` Adam Beneschan
2007-05-05 17:27             ` Markus E Leypold
2007-05-15 23:03               ` Randy Brukardt
2007-05-04 20:04         ` Adam Beneschan
2007-05-05 16:36           ` tmoran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox