* Reading and writing a big file in Ada (GNAT) on Windows XP @ 2007-04-28 5:03 Fionn Mac Cumhaill 2007-04-28 5:20 ` Gautier ` (7 more replies) 0 siblings, 8 replies; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-04-28 5:03 UTC (permalink / raw) I have written a very simple program that runs on a Windows XP machine. (Pentium D , I don't remember the clock speed, it is in middle of Pentium D clock speeds, 2 Gb memory) It is compiled with the MinGW GNAT 3.4.5 Ada compiler. All it does is read lines in a loop from a text file with Ada.Text_IO.Get_Line, does minor modifications on about 80% of the lines that it reads, and writes the lines to an output file with Put_Line. The modifications consist of replacing a slice of text at the end of a line with another bit of text. The biggest slice is 10 characters, and the replacement slice is always smaller than the original slice. An occasional line of text is about 6000 characters long, but most are about 700 haracters. Get_Line reads them into a String variable that is 10,000 characters long. The problem is that the input file has more than 10 million lines of text in it. The program works perfectly, but takes about 5 hours to run. The Cygwin version of wc can count the lines in the input file in less than one minute. The program that produces the file pulls data from a SQL Server 7 running on an 800 MhZ Pentium III machine with 512 Mb and writes it to a file in 1-1/2 hours. Almost all (99.9%) of the lines are rows from a database table. Why is this so slow? Do I have an Ada problem, a GNAT problem, or a MinGW problem? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill @ 2007-04-28 5:20 ` Gautier 2007-04-29 21:17 ` Fionn Mac Cumhaill 2007-04-28 5:25 ` tmoran ` (6 subsequent siblings) 7 siblings, 1 reply; 45+ messages in thread From: Gautier @ 2007-04-28 5:20 UTC (permalink / raw) Fionn Mac Cumhaill: What is the time when you just rewrite the lines (no modification) ? Perhaps there is (also) a bottleneck (like a search in a search) in the modification algorithm. At least you can know how much time it takes. HTH ______________________________________________________________ Gautier -- http://www.mysunrise.ch/users/gdm/index.htm Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm NB: For a direct answer, e-mail address on the Web site! ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 5:20 ` Gautier @ 2007-04-29 21:17 ` Fionn Mac Cumhaill 0 siblings, 0 replies; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-04-29 21:17 UTC (permalink / raw) On Sat, 28 Apr 2007 07:20:03 +0200, Gautier <gautier@fakeaddress.nil> wrote: >Fionn Mac Cumhaill: >What is the time when you just rewrite the lines (no modification) ? >Perhaps there is (also) a bottleneck (like a search in a search) in the >modification algorithm. >At least you can know how much time it takes. >HTH >______________________________________________________________ >Gautier -- http://www.mysunrise.ch/users/gdm/index.htm >Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm > I'll try it with no modifications, but I would be surprised if the modification itself takes up much time. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill 2007-04-28 5:20 ` Gautier @ 2007-04-28 5:25 ` tmoran 2007-04-28 6:56 ` Martin Krischik ` (5 subsequent siblings) 7 siblings, 0 replies; 45+ messages in thread From: tmoran @ 2007-04-28 5:25 UTC (permalink / raw) >The problem is that the input file has more than 10 million lines of >text in it. The program works perfectly, but takes about 5 hours to >run. The Cygwin version of wc can count the lines in the input file in Is your problem in Text_IO or in the scanning & "modifications" code? If your program does no scanning and makes no modifications how long does it take? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill 2007-04-28 5:20 ` Gautier 2007-04-28 5:25 ` tmoran @ 2007-04-28 6:56 ` Martin Krischik 2007-04-28 17:12 ` tmoran 2007-04-28 12:41 ` Jeffrey Creem ` (4 subsequent siblings) 7 siblings, 1 reply; 45+ messages in thread From: Martin Krischik @ 2007-04-28 6:56 UTC (permalink / raw) Fionn Mac Cumhaill wrote: > All it does is read lines in a loop from a text file with > Ada.Text_IO.Get_Line, does minor modifications on about 80% of the > lines that it reads, and writes the lines to an output file with > Put_Line. I wrote something like as well. Only I did it genericly as part of the AdaCL [1] library. It could be extended to solve your problem. Now, what I did was to create pipline with three threads - one to read the line, one to make the change, and one to write the lines. Each step of the pipline where connected with a fifo buffer. The idea was that while the system is waiting for new data from the disk it could convert an few lines from the buffer. When I wrote it I use SCSI drives so I could expect better performance from such a setup. But you might still give it a try. I used used pretty modest buffering - in your case I would go for bigger buffers. Martin [1] http://adacl.sf.net -- mailto://krischik@users.sourceforge.net Ada programming at: http://ada.krischik.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 6:56 ` Martin Krischik @ 2007-04-28 17:12 ` tmoran 0 siblings, 0 replies; 45+ messages in thread From: tmoran @ 2007-04-28 17:12 UTC (permalink / raw) >Now, what I did was to create pipline with three threads - one to read the >line, one to make the change, and one to write the lines. Each step of the If the scanning&modification is taking a lot of CPU time and you have a multicore CPU, it might also be worthwhile to have more than one "make the change" tasks. Experiment shows that helps on a simple word counter. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill ` (2 preceding siblings ...) 2007-04-28 6:56 ` Martin Krischik @ 2007-04-28 12:41 ` Jeffrey Creem 2007-04-29 21:35 ` Fionn Mac Cumhaill 2007-04-28 13:22 ` (see below) ` (3 subsequent siblings) 7 siblings, 1 reply; 45+ messages in thread From: Jeffrey Creem @ 2007-04-28 12:41 UTC (permalink / raw) Fionn Mac Cumhaill wrote: > I have written a very simple program that runs on a Windows XP > machine. (Pentium D , I don't remember the clock speed, it is in > middle of Pentium D clock speeds, 2 Gb memory) It is compiled with the > MinGW GNAT 3.4.5 Ada compiler. > > All it does is read lines in a loop from a text file with > Ada.Text_IO.Get_Line, does minor modifications on about 80% of the > lines that it reads, and writes the lines to an output file with > Put_Line. > > The modifications consist of replacing a slice of text at the end of a > line with another bit of text. The biggest slice is 10 characters, and > the replacement slice is always smaller than the original slice. An > occasional line of text is about 6000 characters long, but most are > about 700 haracters. Get_Line reads them into a String variable that > is 10,000 characters long. > > > The problem is that the input file has more than 10 million lines of > text in it. The program works perfectly, but takes about 5 hours to > run. The Cygwin version of wc can count the lines in the input file in > less than one minute. > > The program that produces the file pulls data from a SQL Server 7 > running on an 800 MhZ Pentium III machine with 512 Mb and writes it to > a file in 1-1/2 hours. Almost all (99.9%) of the lines are rows from a > database table. > > Why is this so slow? > Do I have an Ada problem, a GNAT problem, or a MinGW problem? Is there any chance that the code is doing things like initializing the string to blanks/nulls before each getline and or scanning beyond the value returned for last when doing the replace? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 12:41 ` Jeffrey Creem @ 2007-04-29 21:35 ` Fionn Mac Cumhaill 0 siblings, 0 replies; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-04-29 21:35 UTC (permalink / raw) On Sat, 28 Apr 2007 08:41:26 -0400, Jeffrey Creem <jeff@thecreems.com> wrote: >Fionn Mac Cumhaill wrote: >> I have written a very simple program >> Do I have an Ada problem, a GNAT problem, or a MinGW problem? ... snip ... > >Is there any chance that the code is doing things like initializing the >string to blanks/nulls before each getline and or scanning beyond the >value returned for last when doing the replace? It does indeed initialize the string buffer. I put that in a a quick bug fix, but the version without the initialization still took 5 hours. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill ` (3 preceding siblings ...) 2007-04-28 12:41 ` Jeffrey Creem @ 2007-04-28 13:22 ` (see below) 2007-04-28 17:56 ` Simon Wright ` (2 subsequent siblings) 7 siblings, 0 replies; 45+ messages in thread From: (see below) @ 2007-04-28 13:22 UTC (permalink / raw) On 28/4/07 06:03, in article 0hj5339mjmond132qhbn2o01unurs61lbj@4ax.com, "Fionn Mac Cumhaill" <invisible@hiding.from.spam> wrote: > I have written a very simple program that runs on a Windows XP > machine. (Pentium D , I don't remember the clock speed, it is in > middle of Pentium D clock speeds, 2 Gb memory) It is compiled with the > MinGW GNAT 3.4.5 Ada compiler. > > All it does is read lines in a loop from a text file with > Ada.Text_IO.Get_Line, does minor modifications on about 80% of the > lines that it reads, and writes the lines to an output file with > Put_Line. ... > The problem is that the input file has more than 10 million lines of > text in it. The program works perfectly, but takes about 5 hours to > run. The Cygwin version of wc can count the lines in the input file in > less than one minute. I have an assembler that reads source code and writes an equal number of lines of object code, all I/O done using Ada.Text_IO.{Get_Line, Put(str)}. That takes less than 6 sec CPU time, on a Mac Powerbook 1.67GHz laptop, for 1_000_000 lines of input, giving a performance of around 10 million lines per minute - around 300 times faster than you are getting - so I think it unlikely that Ada.Text_IO is your problem. -- Bill Findlay <surname><forename> chez blueyonder.co.uk ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill ` (4 preceding siblings ...) 2007-04-28 13:22 ` (see below) @ 2007-04-28 17:56 ` Simon Wright 2007-04-28 18:28 ` Jeffrey Creem 2007-04-29 21:42 ` Fionn Mac Cumhaill 2007-04-28 19:12 ` Jeffrey R. Carter 2007-05-02 7:46 ` george 7 siblings, 2 replies; 45+ messages in thread From: Simon Wright @ 2007-04-28 17:56 UTC (permalink / raw) Fionn Mac Cumhaill <invisible@hiding.from.spam> writes: > The modifications consist of replacing a slice of text at the end of > a line with another bit of text. The biggest slice is 10 characters, > and the replacement slice is always smaller than the original > slice. An occasional line of text is about 6000 characters long, but > most are about 700 haracters. Get_Line reads them into a String > variable that is 10,000 characters long. Yesterday I found a performance (and stack!) related problem in GNAT's Ada.Strings.Fixed.Index. The string is mapped onto the stack before the search is done. But I wouldn't have thought this would result in a 5 hour run time. How do you decide which lines need conversion? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 17:56 ` Simon Wright @ 2007-04-28 18:28 ` Jeffrey Creem 2007-04-29 7:20 ` Simon Wright 2007-04-29 21:44 ` Fionn Mac Cumhaill 2007-04-29 21:42 ` Fionn Mac Cumhaill 1 sibling, 2 replies; 45+ messages in thread From: Jeffrey Creem @ 2007-04-28 18:28 UTC (permalink / raw) Simon Wright wrote: > Fionn Mac Cumhaill <invisible@hiding.from.spam> writes: > >> The modifications consist of replacing a slice of text at the end of >> a line with another bit of text. The biggest slice is 10 characters, >> and the replacement slice is always smaller than the original >> slice. An occasional line of text is about 6000 characters long, but >> most are about 700 haracters. Get_Line reads them into a String >> variable that is 10,000 characters long. > > Yesterday I found a performance (and stack!) related problem in GNAT's > Ada.Strings.Fixed.Index. The string is mapped onto the stack before > the search is done. But I wouldn't have thought this would result in a > 5 hour run time. How do you decide which lines need conversion? I agree, it does not seem that either that or the guesses I had related to zeroing at the string (a common mistake) would result in the level of slowdown being seen...However, now that you mention it, I see a few ugly things like that related to Ada.Strings.Fixed. Did you submit a bug report to FSF GCC? As for the original poster, any chance you are doing dynamic memory allocation and leaking and perhaps causing swapping? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 18:28 ` Jeffrey Creem @ 2007-04-29 7:20 ` Simon Wright 2007-04-29 21:44 ` Fionn Mac Cumhaill 1 sibling, 0 replies; 45+ messages in thread From: Simon Wright @ 2007-04-29 7:20 UTC (permalink / raw) Jeffrey Creem <jeff@thecreems.com> writes: > Simon Wright wrote: >> Fionn Mac Cumhaill <invisible@hiding.from.spam> writes: >> >>> The modifications consist of replacing a slice of text at the end of >>> a line with another bit of text. The biggest slice is 10 characters, >>> and the replacement slice is always smaller than the original >>> slice. An occasional line of text is about 6000 characters long, but >>> most are about 700 haracters. Get_Line reads them into a String >>> variable that is 10,000 characters long. >> >> Yesterday I found a performance (and stack!) related problem in GNAT's >> Ada.Strings.Fixed.Index. The string is mapped onto the stack before >> the search is done. But I wouldn't have thought this would result in a >> 5 hour run time. How do you decide which lines need conversion? > > I agree, it does not seem that either that or the guesses I had > related to zeroing at the string (a common mistake) would result in > the level of slowdown being seen...However, now that you mention it, I > see a few ugly things like that related to Ada.Strings.Fixed. Did you > submit a bug report to FSF GCC? No, to AdaCore. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 18:28 ` Jeffrey Creem 2007-04-29 7:20 ` Simon Wright @ 2007-04-29 21:44 ` Fionn Mac Cumhaill 1 sibling, 0 replies; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-04-29 21:44 UTC (permalink / raw) On Sat, 28 Apr 2007 14:28:56 -0400, Jeffrey Creem <jeff@thecreems.com> wrote: >Simon Wright wrote: >> Fionn Mac Cumhaill <invisible@hiding.from.spam> writes: >> >>> The modifications consist of replacing a slice of text at the end of >>> a line with another bit of text. The biggest slice is 10 characters, >>> and the replacement slice is always smaller than the original >>> slice. An occasional line of text is about 6000 characters long, but >>> most are about 700 haracters. Get_Line reads them into a String >>> variable that is 10,000 characters long. >> >> Yesterday I found a performance (and stack!) related problem in GNAT's >> Ada.Strings.Fixed.Index. The string is mapped onto the stack before >> the search is done. But I wouldn't have thought this would result in a >> 5 hour run time. How do you decide which lines need conversion? > >I agree, it does not seem that either that or the guesses I had related >to zeroing at the string (a common mistake) would result in the level of >slowdown being seen...However, now that you mention it, I see a few ugly >things like that related to Ada.Strings.Fixed. Did you submit a bug >report to FSF GCC? > > >As for the original poster, any chance you are doing dynamic memory >allocation and leaking and perhaps causing swapping? I didn't see any swapfile activity at all. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 17:56 ` Simon Wright 2007-04-28 18:28 ` Jeffrey Creem @ 2007-04-29 21:42 ` Fionn Mac Cumhaill 2007-04-30 0:48 ` Jeffrey R. Carter 1 sibling, 1 reply; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-04-29 21:42 UTC (permalink / raw) On Sat, 28 Apr 2007 18:56:47 +0100, Simon Wright <simon.j.wright@mac.com> wrote: >Fionn Mac Cumhaill <invisible@hiding.from.spam> writes: > >> The modifications consist of replacing a slice of text at the end of >> a line with another bit of text. The biggest slice is 10 characters, >> and the replacement slice is always smaller than the original >> slice. An occasional line of text is about 6000 characters long, but >> most are about 700 haracters. Get_Line reads them into a String >> variable that is 10,000 characters long. > >Yesterday I found a performance (and stack!) related problem in GNAT's >Ada.Strings.Fixed.Index. The string is mapped onto the stack before >the search is done. But I wouldn't have thought this would result in a >5 hour run time. How do you decide which lines need conversion? I use Index to search for the string that needs to be replaced. if the returned value is not zero, that line needs to be changed. In 99.99% of the changes, I look for ", '')," and replace it with ")," ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-29 21:42 ` Fionn Mac Cumhaill @ 2007-04-30 0:48 ` Jeffrey R. Carter 2007-04-30 2:30 ` Fionn Mac Cumhaill 0 siblings, 1 reply; 45+ messages in thread From: Jeffrey R. Carter @ 2007-04-30 0:48 UTC (permalink / raw) Fionn Mac Cumhaill wrote: > > I use Index to search for the string that needs to be replaced. if the > returned value is not zero, that line needs to be changed. In 99.99% > of the changes, I look for ", '')," and replace it with ")," In a previous post, you said the pattern is found near the end of the string. Can you use the "Going => Backward" option to Index, and if so, do you? -- Jeff Carter "You've got the brain of a four-year-old boy, and I bet he was glad to get rid of it." Horse Feathers 47 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-30 0:48 ` Jeffrey R. Carter @ 2007-04-30 2:30 ` Fionn Mac Cumhaill 2007-04-30 4:21 ` tmoran 0 siblings, 1 reply; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-04-30 2:30 UTC (permalink / raw) On Mon, 30 Apr 2007 00:48:36 GMT, "Jeffrey R. Carter" <jrcarter@acm.org> wrote: >Fionn Mac Cumhaill wrote: >> >> I use Index to search for the string that needs to be replaced. if the >> returned value is not zero, that line needs to be changed. In 99.99% >> of the changes, I look for ", '')," and replace it with ")," > >In a previous post, you said the pattern is found near the end of the >string. Can you use the "Going => Backward" option to Index, and if so, >do you? I didn't try that, but I will when I get back to work. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-30 2:30 ` Fionn Mac Cumhaill @ 2007-04-30 4:21 ` tmoran 0 siblings, 0 replies; 45+ messages in thread From: tmoran @ 2007-04-30 4:21 UTC (permalink / raw) Using Gnat 3.15p my Windows 2000 3.0GHz machine with an older disk wrote 10K 700 character lines in 0.14 sec. It read them back, doing Ada.Strings.Fixed.Index for "b" (which they did not contain) in 1.5 seconds. That should translate to 1650 seconds or 27.5 minutes to do a read+scan+write on 10M such lines. Skipping the IO and just doing the scan 10K times took 0.4 seconds. Replacing the Ada.Strings.Fixed.Index with a loop searching for 'b' took 0.02 seconds. So, if your program takes 5 hours something seriously unpleasant is going on. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill ` (5 preceding siblings ...) 2007-04-28 17:56 ` Simon Wright @ 2007-04-28 19:12 ` Jeffrey R. Carter 2007-04-29 21:46 ` Fionn Mac Cumhaill 2007-05-02 7:46 ` george 7 siblings, 1 reply; 45+ messages in thread From: Jeffrey R. Carter @ 2007-04-28 19:12 UTC (permalink / raw) Fionn Mac Cumhaill wrote: > > All it does is read lines in a loop from a text file with > Ada.Text_IO.Get_Line, does minor modifications on about 80% of the > lines that it reads, and writes the lines to an output file with > Put_Line. > > The modifications consist of replacing a slice of text at the end of a > line with another bit of text. The biggest slice is 10 characters, and > the replacement slice is always smaller than the original slice. An > occasional line of text is about 6000 characters long, but most are > about 700 haracters. Get_Line reads them into a String variable that > is 10,000 characters long. > > The problem is that the input file has more than 10 million lines of > text in it. The program works perfectly, but takes about 5 hours to > run. The Cygwin version of wc can count the lines in the input file in > less than one minute. > > Why is this so slow? > Do I have an Ada problem, a GNAT problem, or a MinGW problem? It's hard to tell. Text_IO is quite heavy-weight compared to C text I/O, but I'd be surprised if it made that much difference. This sounds as if it would be a pretty simple program; if so, you could post it (or a reasonable facsimile, if there's a reason you can't post the actual code) here, and we would be better able to help. -- Jeff Carter "Nobody expects the Spanish Inquisition!" Monty Python's Flying Circus 22 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 19:12 ` Jeffrey R. Carter @ 2007-04-29 21:46 ` Fionn Mac Cumhaill 2007-05-01 14:10 ` Fionn Mac Cumhaill 2007-05-06 21:55 ` Quarc 0 siblings, 2 replies; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-04-29 21:46 UTC (permalink / raw) On Sat, 28 Apr 2007 19:12:33 GMT, "Jeffrey R. Carter" <jrcarter@acm.org> wrote: >Fionn Mac Cumhaill wrote: >> >> All it does is read lines in a loop from a text file with >> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the >> lines that it reads, and writes the lines to an output file with >> Put_Line. >> >> The modifications consist of replacing a slice of text at the end of a >> line with another bit of text. The biggest slice is 10 characters, and >> the replacement slice is always smaller than the original slice. An >> occasional line of text is about 6000 characters long, but most are >> about 700 haracters. Get_Line reads them into a String variable that >> is 10,000 characters long. >> >> The problem is that the input file has more than 10 million lines of >> text in it. The program works perfectly, but takes about 5 hours to >> run. The Cygwin version of wc can count the lines in the input file in >> less than one minute. >> >> Why is this so slow? >> Do I have an Ada problem, a GNAT problem, or a MinGW problem? > >It's hard to tell. Text_IO is quite heavy-weight compared to C text I/O, >but I'd be surprised if it made that much difference. > >This sounds as if it would be a pretty simple program; if so, you could >post it (or a reasonable facsimile, if there's a reason you can't post >the actual code) here, and we would be better able to help. I'm posting to cla from home. The program is at work. I'll post the code when I get back to work. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-29 21:46 ` Fionn Mac Cumhaill @ 2007-05-01 14:10 ` Fionn Mac Cumhaill 2007-05-06 21:55 ` Quarc 1 sibling, 0 replies; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-05-01 14:10 UTC (permalink / raw) On Sun, 29 Apr 2007 21:46:29 GMT, Fionn Mac Cumhaill <invisible@hiding.from.spam> wrote: >On Sat, 28 Apr 2007 19:12:33 GMT, "Jeffrey R. Carter" ><jrcarter@acm.org> wrote: > >>Fionn Mac Cumhaill wrote: >>> >>> All it does is read lines in a loop from a text file with >>> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the >>> lines that it reads, and writes the lines to an output file with >>> Put_Line. >>> >>> The modifications consist of replacing a slice of text at the end of a >>> line with another bit of text. The biggest slice is 10 characters, and >>> the replacement slice is always smaller than the original slice. An >>> occasional line of text is about 6000 characters long, but most are >>> about 700 haracters. Get_Line reads them into a String variable that >>> is 10,000 characters long. >>> >>> The problem is that the input file has more than 10 million lines of >>> text in it. The program works perfectly, but takes about 5 hours to >>> run. The Cygwin version of wc can count the lines in the input file in >>> less than one minute. >>> >>> Why is this so slow? >>> Do I have an Ada problem, a GNAT problem, or a MinGW problem? >> >>It's hard to tell. Text_IO is quite heavy-weight compared to C text I/O, >>but I'd be surprised if it made that much difference. >> >>This sounds as if it would be a pretty simple program; if so, you could >>post it (or a reasonable facsimile, if there's a reason you can't post >>the actual code) here, and we would be better able to help. > >I'm posting to cla from home. The program is at work. I'll post the >code when I get back to work. Well, I was too pressed for time to do any one-change-at-a-time experimentation; I removed the buffer initialization, and changed the Index searches to use Backward. I removed some other uses of Index, as the slice for which I was looking was at a fixed location. Run time fell to 8 minutes. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-29 21:46 ` Fionn Mac Cumhaill 2007-05-01 14:10 ` Fionn Mac Cumhaill @ 2007-05-06 21:55 ` Quarc 1 sibling, 0 replies; 45+ messages in thread From: Quarc @ 2007-05-06 21:55 UTC (permalink / raw) Fionn Mac Cumhaill wrote: > On Sat, 28 Apr 2007 19:12:33 GMT, "Jeffrey R. Carter" > <jrcarter@acm.org> wrote: > >> Fionn Mac Cumhaill wrote: >>> All it does is read lines in a loop from a text file with >>> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the >>> lines that it reads, and writes the lines to an output file with >>> Put_Line. >>> >>> The modifications consist of replacing a slice of text at the end of a >>> line with another bit of text. The biggest slice is 10 characters, and >>> the replacement slice is always smaller than the original slice. An >>> occasional line of text is about 6000 characters long, but most are >>> about 700 haracters. Get_Line reads them into a String variable that >>> is 10,000 characters long. >>> >>> The problem is that the input file has more than 10 million lines of >>> text in it. The program works perfectly, but takes about 5 hours to >>> run. The Cygwin version of wc can count the lines in the input file in >>> less than one minute. >>> >>> Why is this so slow? >>> Do I have an Ada problem, a GNAT problem, or a MinGW problem? Sorry, but I don't have your full question here so trying to figure out your problem for the quote above. I woul assume the big difference between what you are doing and what WC does is that wc only reads the file, and can therefore do it in one go. I am speculation now, but I think one explanation for this would be that you are reading, then writing each line by itself. Depending on the OS you are running on, and also the binding to the OS this could potentially create a loot of seek times on the disk. This could explain the long times I think. One of two things happen I think (or potentially both). 1) You read 700 bytes from disk, you write 700 bytes to disk. Assume average seek times of 5 ms and you get to 2*50 000 s = 100 000 s = 28 hours !!!! 2) Instead you creating the seektimes on the disk between reads, ther are other processes waiting for disk access in the OS as well. SO qite often some other process get to the top of the waiting queue, and can access disk, causing the seektimes. If one of the above is causing your problem, the only solution is to read in larger chunks at once, and then write larger chunks of data as well. I don't remember how to set that up, and it is of course depending on compiler and OS, but hopefully you should be able to make sure you don't read line by line from disk. regards Peter Atterfj�ll (haven't been working professionally with Ada for 14 years now :-) ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-04-28 5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill ` (6 preceding siblings ...) 2007-04-28 19:12 ` Jeffrey R. Carter @ 2007-05-02 7:46 ` george 2007-05-03 6:31 ` Fionn Mac Cumhaill 7 siblings, 1 reply; 45+ messages in thread From: george @ 2007-05-02 7:46 UTC (permalink / raw) Fionn Mac Cumhaill wrote: > All it does is read lines in a loop from a text file with > Ada.Text_IO.Get_Line, does minor modifications on about 80% of the > lines that it reads, and writes the lines to an output file with > Put_Line. Looks like a perfect task for sed utility to me. Even if sed does not do all you need you can at the very least check some simple sed transformation on your data, which should give you an estimate of what run times you can expect with your file. I am not familiar with MinGW environment. If it is anything like CygWin then you should be able to run GNU tools and perform that test with sed. At least this should answer the > Do I have an Ada problem, a GNAT problem, or a MinGW problem? (and CygWin does tend to be slower than native computation, however I used it only a few times). George ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-02 7:46 ` george @ 2007-05-03 6:31 ` Fionn Mac Cumhaill 2007-05-03 20:00 ` Simon Wright 2007-05-03 20:27 ` Reading and writing a big file in Ada (GNAT) on Windows XP Adam Beneschan 0 siblings, 2 replies; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-05-03 6:31 UTC (permalink / raw) On 2 May 2007 00:46:07 -0700, george@gentoo.org wrote: >Fionn Mac Cumhaill wrote: >> All it does is read lines in a loop from a text file with >> Ada.Text_IO.Get_Line, does minor modifications on about 80% of the >> lines that it reads, and writes the lines to an output file with >> Put_Line. >Looks like a perfect task for sed utility to me. Even if sed does not >do all you need you can at the very least check some simple sed >transformation on your data, which should give you an estimate of what >run times you can expect with your file. >I am not familiar with MinGW environment. If it is anything like >CygWin then you should be able to run GNU tools and perform that test >with sed. At least this should answer the > >> Do I have an Ada problem, a GNAT problem, or a MinGW problem? > >(and CygWin does tend to be slower than native computation, however I >used it only a few times). > > >George I considered sed, but I know next to nothing about it, and decided that by the time I learned enough to use it, I could have done the job in Ada several times over. I did do further investigation; I made a copy of the now-working program and threw most of the program away, leaving only a very simple program which read the large input file, but made no changes and did no output. I added code to track the run time and put the buffer clear back in. It read the 10 million lines in just a little over five minutes. I then put Index back and used it to search the buffer for a short string that would never be found, seatching forwards from the beginning of the input buffer. Bingo. Run time increased to a bit more than 1-1/2 hours. One of the earlier commenters observed that there was a bug in GNAT's Index. It will be interesting to try this again once the bug is fixed. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-03 6:31 ` Fionn Mac Cumhaill @ 2007-05-03 20:00 ` Simon Wright 2007-05-04 4:35 ` Jeffrey R. Carter ` (2 more replies) 2007-05-03 20:27 ` Reading and writing a big file in Ada (GNAT) on Windows XP Adam Beneschan 1 sibling, 3 replies; 45+ messages in thread From: Simon Wright @ 2007-05-03 20:00 UTC (permalink / raw) Fionn Mac Cumhaill <invisible@hiding.from.spam> writes: > One of the earlier commenters observed that there was a bug in > GNAT's Index. It will be interesting to try this again once the bug > is fixed. That was me. I could give you my version of Index if you want .. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-03 20:00 ` Simon Wright @ 2007-05-04 4:35 ` Jeffrey R. Carter 2007-05-04 4:45 ` Fionn Mac Cumhaill 2007-05-04 6:53 ` Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) Jacob Sparre Andersen 2 siblings, 0 replies; 45+ messages in thread From: Jeffrey R. Carter @ 2007-05-04 4:35 UTC (permalink / raw) Simon Wright wrote: > > That was me. I could give you my version of Index if you want .. Or he could try PragmARC.Quick_Searcher. The algorithm it implements is supposed to be faster than Boyer-Moore "for most reasonable search patterns". -- Jeff Carter "Mr. President, we must not allow a mine-shaft gap!" Dr. Strangelove 33 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-03 20:00 ` Simon Wright 2007-05-04 4:35 ` Jeffrey R. Carter @ 2007-05-04 4:45 ` Fionn Mac Cumhaill 2007-05-04 6:53 ` Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) Jacob Sparre Andersen 2 siblings, 0 replies; 45+ messages in thread From: Fionn Mac Cumhaill @ 2007-05-04 4:45 UTC (permalink / raw) On Thu, 03 May 2007 21:00:25 +0100, Simon Wright <simon.j.wright@mac.com> wrote: >Fionn Mac Cumhaill <invisible@hiding.from.spam> writes: > >> One of the earlier commenters observed that there was a bug in >> GNAT's Index. It will be interesting to try this again once the bug >> is fixed. > >That was me. I could give you my version of Index if you want .. I would indeed like to see it. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) 2007-05-03 20:00 ` Simon Wright 2007-05-04 4:35 ` Jeffrey R. Carter 2007-05-04 4:45 ` Fionn Mac Cumhaill @ 2007-05-04 6:53 ` Jacob Sparre Andersen 2007-05-04 7:41 ` Dmitry A. Kazakov 2 siblings, 1 reply; 45+ messages in thread From: Jacob Sparre Andersen @ 2007-05-04 6:53 UTC (permalink / raw) Simon Wright <simon.j.wright@mac.com> wrote: > That was me. I could give you my version of Index if you want .. I would like to have a copy as well. My demonstration of how you can make a fast implementation of the Unix command "tail" with POSIX.Memory_Mapping was not very successful, and I suspect that the sinner may be Index. Greetings, Jacob -- Ada - the programming language still ahead of its time. -- tail1.adb - show the last ten lines from a named text file with Ada.Characters.Latin_1; with Ada.Command_Line; with Ada.Strings.Fixed; with Ada.Text_IO; with POSIX; with POSIX.IO; with POSIX.Memory_Mapping; with System; with System.Storage_Elements; procedure Tail is use POSIX; use POSIX.IO; use POSIX.Memory_Mapping; use System.Storage_Elements; Text_File : File_Descriptor; Text_Size : System.Storage_Elements.Storage_Offset; Text_Address : System.Address; begin Text_File := Open (Name => To_POSIX_String (Ada.Command_Line.Argument (1)), Mode => Read_Only); Text_Size := Storage_Offset (File_Size (Text_File)); Text_Address := Map_Memory (Length => Text_Size, Protection => Allow_Read, Mapping => Map_Shared, File => Text_File, Offset => 0); declare use Ada.Strings; use Ada.Strings.Fixed; package Latin_1 renames Ada.Characters.Latin_1; Bit_Count : constant Natural := Natural (Text_Size) * Storage_Element'Size; Character_Count : constant Natural := Bit_Count / Character'Size; Text : String (1 .. Character_Count); for Text'Address use Text_Address; Last : Natural := Text'Last + 1; begin for Lines in 1 .. 10 loop Last := Index (Source => Text (Text'First .. Last - 1), Pattern => (1 => Latin_1.LF), Going => Backward); exit when Last = 0; end loop; declare POSIX_Text : POSIX_String (Text'Range); for POSIX_Text'Address use Text'Address; begin while Last < Text'Last loop NONSTANDARD_Write (File => Standard_Output, Buffer => POSIX_Text (Last + 1 .. Text'Last), Last => Last); end loop; end; end; Unmap_Memory (First => Text_Address, Length => Text_Size); Close (File => Text_File); end Tail; ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) 2007-05-04 6:53 ` Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) Jacob Sparre Andersen @ 2007-05-04 7:41 ` Dmitry A. Kazakov 2007-05-04 9:16 ` Copying string slices before calling subroutines? (Was: Alternative Index implementation?) Jacob Sparre Andersen 0 siblings, 1 reply; 45+ messages in thread From: Dmitry A. Kazakov @ 2007-05-04 7:41 UTC (permalink / raw) On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote: > My demonstration of how you can make a fast implementation of the Unix > command "tail" with POSIX.Memory_Mapping was not very successful, and > I suspect that the sinner may be Index. [...] > for Lines in 1 .. 10 loop > Last := Index (Source => Text (Text'First .. Last - 1), Did you check if GNAT did not copy the string slice before calling to Index? > Pattern => (1 => Latin_1.LF), > Going => Backward); > exit when Last = 0; > end loop; -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 45+ messages in thread
* Copying string slices before calling subroutines? (Was: Alternative Index implementation?) 2007-05-04 7:41 ` Dmitry A. Kazakov @ 2007-05-04 9:16 ` Jacob Sparre Andersen 2007-05-04 9:44 ` Copying string slices before calling subroutines? Jacob Sparre Andersen 0 siblings, 1 reply; 45+ messages in thread From: Jacob Sparre Andersen @ 2007-05-04 9:16 UTC (permalink / raw) Dmitry A. Kazakov wrote: > On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote: >> for Lines in 1 .. 10 loop >> Last := Index (Source => Text (Text'First .. Last - 1), > > Did you check if GNAT did not copy the string slice before calling to > Index? No. Do you have any suggestion for a way to check it. The reference manual (section 6.2) seems to indicate that it is up to the compiler to decide. Greetings, Jacob -- Beware of people with Gnus, they may Hurd you. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Copying string slices before calling subroutines? 2007-05-04 9:16 ` Copying string slices before calling subroutines? (Was: Alternative Index implementation?) Jacob Sparre Andersen @ 2007-05-04 9:44 ` Jacob Sparre Andersen 2007-05-04 10:14 ` Dmitry A. Kazakov 0 siblings, 1 reply; 45+ messages in thread From: Jacob Sparre Andersen @ 2007-05-04 9:44 UTC (permalink / raw) Jacob Sparre Andersen wrote: > Dmitry A. Kazakov wrote: >> On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote: >>> for Lines in 1 .. 10 loop >>> Last := Index (Source => Text (Text'First .. Last - 1), >> >> Did you check if GNAT did not copy the string slice before calling to >> Index? > > No. Do you have any suggestion for a way to check it. The > reference manual (section 6.2) seems to indicate that it is up to > the compiler to decide. I have found a strong indication that GNAT copies the whole string slice onto the stack before calling Index: It is limited how large files I can run my Tail program on, before I get a segmentation fault. The limit is roughly the allowed stack size on the system (and changes as I change the allowed stack size). This version appears to be as fast as the GNU Tail program. The only thing which worries me a bit is that I see roughly twice as many "minor pagefaults" with my own version as with the GNU version. Greetings, Jacob PS: I don't feel like cheating and looking at the GNU source code yet. -- "Then, after a second or so, nothing continued to happen." ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Copying string slices before calling subroutines? 2007-05-04 9:44 ` Copying string slices before calling subroutines? Jacob Sparre Andersen @ 2007-05-04 10:14 ` Dmitry A. Kazakov 2007-05-04 12:07 ` Jeffrey Creem 0 siblings, 1 reply; 45+ messages in thread From: Dmitry A. Kazakov @ 2007-05-04 10:14 UTC (permalink / raw) On Fri, 04 May 2007 11:44:14 +0200, Jacob Sparre Andersen wrote: > Jacob Sparre Andersen wrote: >> Dmitry A. Kazakov wrote: >>> On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote: > >>>> for Lines in 1 .. 10 loop >>>> Last := Index (Source => Text (Text'First .. Last - 1), >>> >>> Did you check if GNAT did not copy the string slice before calling to >>> Index? >> >> No. Do you have any suggestion for a way to check it. The >> reference manual (section 6.2) seems to indicate that it is up to >> the compiler to decide. > > I have found a strong indication that GNAT copies the whole string > slice onto the stack before calling Index: So, Index was innocent. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Copying string slices before calling subroutines? 2007-05-04 10:14 ` Dmitry A. Kazakov @ 2007-05-04 12:07 ` Jeffrey Creem 2007-05-04 12:46 ` Dmitry A. Kazakov 2007-05-04 22:27 ` Simon Wright 0 siblings, 2 replies; 45+ messages in thread From: Jeffrey Creem @ 2007-05-04 12:07 UTC (permalink / raw) Dmitry A. Kazakov wrote: > On Fri, 04 May 2007 11:44:14 +0200, Jacob Sparre Andersen wrote: > >> Jacob Sparre Andersen wrote: >>> Dmitry A. Kazakov wrote: >>>> On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote: >>>>> for Lines in 1 .. 10 loop >>>>> Last := Index (Source => Text (Text'First .. Last - 1), >>>> Did you check if GNAT did not copy the string slice before calling to >>>> Index? >>> No. Do you have any suggestion for a way to check it. The >>> reference manual (section 6.2) seems to indicate that it is up to >>> the compiler to decide. >> I have found a strong indication that GNAT copies the whole string >> slice onto the stack before calling Index: > > So, Index was innocent. > I don't know that I'd say that...Though it might be innocent in this case, a quick look at the source code (which then remaps index to a common string search package) looks to me like there are other opportunities where some versions of the Index call will make another copy in cases when it need not and further this does not appear to match the comments in the package. Specifically, in the trunk of the current FSF Ada in a-strsea.adb at the top of the package it says (formatting adjusted on this comment) -- Note: This code is derived from the ADAR.CSH public domain Ada 83 -- versions of the Appendix C string handling packages (code extracted -- from Ada.Strings.Fixed). A significant change is that we optimize -- the case of identity mappings for Count and Index, and also -- Index_Non_Blank is specialized (rather than using the -- general Index routine). Great! Sounds promising...Lets take a look at one of them function Index (Source : String; Pattern : String; Going : Direction := Forward; Mapping : Maps.Character_Mapping := Maps.Identity) return Natural is Cur_Index : Natural; Mapped_Source : String (Source'Range); begin if Pattern = "" then raise Pattern_Error; end if; for J in Source'Range loop Mapped_Source (J) := Value (Mapping, Source (J)); end loop; -- Forwards case if Going = Forward then for J in 1 .. Source'Length - Pattern'Length + 1 loop Cur_Index := Source'First + J - 1; if Pattern = Mapped_Source (Cur_Index .. Cur_Index + Pattern'Length - 1) then return Cur_Index; end if; end loop; -- Backwards case else for J in reverse 1 .. Source'Length - Pattern'Length + 1 loop Cur_Index := Source'First + J - 1; if Pattern = Mapped_Source (Cur_Index .. Cur_Index + Pattern'Length - 1) then return Cur_Index; end if; end loop; end if; -- Fall through if no match found. Note that the loops are skipped -- completely in the case of the pattern being longer than the source. return 0; end Index; Hmm...I don't see the optimization for identity mappings anyplace... You can probably stop reading here as the formatting is ugly but the license for the above code snippet is: -- Copyright (C) 1992-2005, Free Software Foundation, Inc. -- -- -- -- GNAT is free software; you can redistribute it and/or modify it under -- -- terms of the GNU General Public License as published by the Free Soft- -- -- ware Foundation; either version 2, or (at your option) any later ver- -- -- sion. GNAT is distributed in the hope that it will be useful, but WITH- -- -- OUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY -- -- or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License -- -- for more details. You should have received a copy of the GNU General -- -- Public License distributed with GNAT; see file COPYING. If not, write -- -- to the Free Software Foundation, 51 Franklin Street, Fifth Floor, -- -- Boston, MA 02110-1301, USA. -- -- -- -- As a special exception, if other files instantiate generics from this -- -- unit, or you link this unit with other files to produce an executable, -- -- this unit does not by itself cause the resulting executable to be -- -- covered by the GNU General Public License. This exception does not -- -- however invalidate any other reasons why the executable file might be -- -- covered by the GNU Public License. -- -- -- -- GNAT was originally developed by the GNAT team at New York University. -- -- Extensive contributions were provided by Ada Core Technologies Inc. -- ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Copying string slices before calling subroutines? 2007-05-04 12:07 ` Jeffrey Creem @ 2007-05-04 12:46 ` Dmitry A. Kazakov 2007-05-04 22:27 ` Simon Wright 1 sibling, 0 replies; 45+ messages in thread From: Dmitry A. Kazakov @ 2007-05-04 12:46 UTC (permalink / raw) On Fri, 04 May 2007 08:07:59 -0400, Jeffrey Creem wrote: > Dmitry A. Kazakov wrote: >> On Fri, 04 May 2007 11:44:14 +0200, Jacob Sparre Andersen wrote: >> >>> Jacob Sparre Andersen wrote: >>>> Dmitry A. Kazakov wrote: >>>>> On Fri, 04 May 2007 08:53:59 +0200, Jacob Sparre Andersen wrote: >>>>>> for Lines in 1 .. 10 loop >>>>>> Last := Index (Source => Text (Text'First .. Last - 1), >>>>> Did you check if GNAT did not copy the string slice before calling to >>>>> Index? >>>> No. Do you have any suggestion for a way to check it. The >>>> reference manual (section 6.2) seems to indicate that it is up to >>>> the compiler to decide. >>> I have found a strong indication that GNAT copies the whole string >>> slice onto the stack before calling Index: >> >> So, Index was innocent. > > I don't know that I'd say that... Though it might be innocent in this > case, a quick look at the source code (which then remaps index to a > common string search package) looks to me like there are other > opportunities where some versions of the Index call will make another > copy in cases when it need not and further this does not appear to match > the comments in the package. Devastating. I take my words back. > Hmm...I don't see the optimization for identity mappings anyplace... It looks that they call "optimization" copying Source into Mapped_Source with remapping. Probably they hoped that comparing Pattern with a remapping slice could translate into a processor-specific string comparison instruction. Anyway... -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Copying string slices before calling subroutines? 2007-05-04 12:07 ` Jeffrey Creem 2007-05-04 12:46 ` Dmitry A. Kazakov @ 2007-05-04 22:27 ` Simon Wright 2007-05-05 7:33 ` Jacob Sparre Andersen 2007-05-05 7:41 ` Dmitry A. Kazakov 1 sibling, 2 replies; 45+ messages in thread From: Simon Wright @ 2007-05-04 22:27 UTC (permalink / raw) Jeffrey Creem <jeff@thecreems.com> writes: > function Index > (Source : String; > Pattern : String; > Going : Direction := Forward; > Mapping : Maps.Character_Mapping := Maps.Identity) return Natural > is > Cur_Index : Natural; > Mapped_Source : String (Source'Range); There's the problem that I had (in my case Source'Length was > 2_000_000!) calling Index from one of my tasks. My reworking begins function Index (Source : String; Pattern : String; Going : Ada.Strings.Direction := Ada.Strings.Forward; Mapping : Ada.Strings.Maps.Character_Mapping := Ada.Strings.Maps.Identity) return Natural is Cur_Index : Natural; Potential_Match : Boolean; use Ada.Strings; use Ada.Strings.Maps; begin if Pattern = "" then raise Pattern_Error; end if; -- Forwards case if Going = Forward then for J in 1 .. Source'Length - Pattern'Length + 1 loop Cur_Index := Source'First + J - 1; Potential_Match := True; for K in Pattern'Range loop if Pattern (K) /= Value (Mapping, Source (Cur_Index + K - 1)) then Potential_Match := False; exit; end if; end loop; if Potential_Match then return Cur_Index; end if; end loop; which calls Ada.Strings.Maps.Value rather more often than I suppose it could. But since the normal case is for the mapping to be the identity mapping the most useful optimisation would be to check for Identity and just compare directly. Someone else was asking whether GNAT copies on the stack before the call -- I see no evidence of that. I had to provide my own implementation of this Ada05 routine because my code has to be Ada95, and it shows no sign of excessive stack use: function Index (Source : String; Pattern : String; From : Positive; Going : Ada.Strings.Direction := Ada.Strings.Forward; Mapping : Ada.Strings.Maps.Character_Mapping := Ada.Strings.Maps.Identity) return Natural is Candidate : String renames Source (From .. Source'Last); begin return Index (Source => Candidate, Pattern => Pattern, Going => Going, Mapping => Mapping); end Index; ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Copying string slices before calling subroutines? 2007-05-04 22:27 ` Simon Wright @ 2007-05-05 7:33 ` Jacob Sparre Andersen 2007-05-05 7:47 ` Dmitry A. Kazakov 2007-05-05 7:41 ` Dmitry A. Kazakov 1 sibling, 1 reply; 45+ messages in thread From: Jacob Sparre Andersen @ 2007-05-05 7:33 UTC (permalink / raw) Simon Wright <simon.j.wright@mac.com> writes: > Jeffrey Creem <jeff@thecreems.com> writes: >> function Index >> (Source : String; >> Pattern : String; >> Going : Direction := Forward; >> Mapping : Maps.Character_Mapping := Maps.Identity) return Natural >> is >> Cur_Index : Natural; >> Mapped_Source : String (Source'Range); > > There's the problem that I had (in my case Source'Length was > > 2_000_000!) calling Index from one of my tasks. > > My reworking begins [...] > which calls Ada.Strings.Maps.Value rather more often than I suppose > it could. But since the normal case is for the mapping to be the > identity mapping the most useful optimisation would be to check for > Identity and just compare directly. Wouldn't the proper solution be to separate out Mapping = Maps.Identity and Mapping /= Maps.Identity? Even though we can do cool things with default values for arguments, it isn't always efficient. function Index (Source : String; Pattern : String; Going : Direction := Forward) return Natural; function Index (Source : String; Pattern : String; Going : Direction := Forward; Mapping : Maps.Character_Mapping) return Natural; I can't see that there can be any cases, where existing code will work differently with these two specifications instead of the existing single specification. We should also remember that for a compiler writer team, Mapping = Maps.Identity may not be the normal case. > Someone else was asking whether GNAT copies on the stack before the > call -- I see no evidence of that. Now that the implementation of Index has been pointed out to me, I agree. I could just see that it happened at some point during the call to Index. Greetings, Jacob -- �For there are only two reasons why war is made against a republic: The one, to become lord over her: the other, the fear of being occupied by her.� -- Nicolo Machiavelli ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Copying string slices before calling subroutines? 2007-05-05 7:33 ` Jacob Sparre Andersen @ 2007-05-05 7:47 ` Dmitry A. Kazakov 0 siblings, 0 replies; 45+ messages in thread From: Dmitry A. Kazakov @ 2007-05-05 7:47 UTC (permalink / raw) On Sat, 05 May 2007 09:33:26 +0200, Jacob Sparre Andersen wrote: > We should also remember that for a compiler writer team, Mapping = > Maps.Identity may not be the normal case. An alternative could be to change A.4.3. towards two Index functions instead of one: function Index (Source : in String; Pattern : in String; Going : in Direction := Forward; Mapping : in Maps.Character_Mapping) return Natural; function Index (Source : in String; Pattern : in String; Going : in Direction := Forward) return Natural; -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Copying string slices before calling subroutines? 2007-05-04 22:27 ` Simon Wright 2007-05-05 7:33 ` Jacob Sparre Andersen @ 2007-05-05 7:41 ` Dmitry A. Kazakov 1 sibling, 0 replies; 45+ messages in thread From: Dmitry A. Kazakov @ 2007-05-05 7:41 UTC (permalink / raw) On Fri, 04 May 2007 23:27:57 +0100, Simon Wright wrote: > My reworking begins > > function Index > (Source : String; > Pattern : String; > Going : Ada.Strings.Direction := Ada.Strings.Forward; > Mapping : Ada.Strings.Maps.Character_Mapping > := Ada.Strings.Maps.Identity) return Natural > is > Cur_Index : Natural; > Potential_Match : Boolean; > use Ada.Strings; > use Ada.Strings.Maps; > begin > if Pattern = "" then > raise Pattern_Error; > end if; > > -- Forwards case > > if Going = Forward then > for J in 1 .. Source'Length - Pattern'Length + 1 loop > Cur_Index := Source'First + J - 1; > Potential_Match := True; > for K in Pattern'Range loop > if Pattern (K) /= > Value (Mapping, Source (Cur_Index + K - 1)) then > Potential_Match := False; > exit; > end if; > end loop; > if Potential_Match then > return Cur_Index; > end if; > end loop; > > which calls Ada.Strings.Maps.Value rather more often than I suppose it > could. You can save remapping of Source, just there is no need to do it in advance and allocate a full copy of Source. Instead of that, you make a ring buffer of mod Pattern'Length where you store Value (Mapping, Source (i)). Once you advance the main string index you "rotate" the buffer. Then Pattern'Length = 1 should be a special case, as well as identity mapping. Though, I don't know how efficient the latter could be. -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-03 6:31 ` Fionn Mac Cumhaill 2007-05-03 20:00 ` Simon Wright @ 2007-05-03 20:27 ` Adam Beneschan 2007-05-03 23:01 ` Randy Brukardt 1 sibling, 1 reply; 45+ messages in thread From: Adam Beneschan @ 2007-05-03 20:27 UTC (permalink / raw) On May 2, 11:31 pm, Fionn Mac Cumhaill <invisi...@hiding.from.spam> wrote: > I did do further investigation; I made a copy of the now-working > program and threw most of the program away, leaving only a very simple > program which read the large input file, but made no changes and did > no output. I added code to track the run time and put the buffer clear > back in. It read the 10 million lines in just a little over five > minutes. I then put Index back and used it to search the buffer for a > short string that would never be found, seatching forwards from the > beginning of the input buffer. Bingo. Run time increased to a bit more > than 1-1/2 hours. It strikes me that Index is the kind of function that really ought to be written in assembly language, at least partially. I notice that the version of Linux that I'm using has a built-in function to search memory for a substring; this is very descriptively called memmem() and has the amusing profile void *memmem (const void *needle, size_t needlelen, const void *haystack, size_t haystacklen); according to the man page. But I assume this is written to use registers optimally and take advantage of the REP instructions (on an x86 or Pentium). I don't know how GNAT implements Index---I haven't looked into it. -- Adam ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-03 20:27 ` Reading and writing a big file in Ada (GNAT) on Windows XP Adam Beneschan @ 2007-05-03 23:01 ` Randy Brukardt 2007-05-04 0:28 ` Markus E Leypold 2007-05-04 20:04 ` Adam Beneschan 0 siblings, 2 replies; 45+ messages in thread From: Randy Brukardt @ 2007-05-03 23:01 UTC (permalink / raw) "Adam Beneschan" <adam@irvine.com> wrote in message news:1178224048.034635.39010@l77g2000hsb.googlegroups.com... ... > It strikes me that Index is the kind of function that really ought to > be written in assembly language, at least partially. I notice that > the version of Linux that I'm using has a built-in function to search > memory for a substring; this is very descriptively called memmem() and > has the amusing profile > > void *memmem (const void *needle, size_t needlelen, > const void *haystack, size_t haystacklen); > > according to the man page. But I assume this is written to use > registers optimally and take advantage of the REP instructions (on an > x86 or Pentium). I don't know how GNAT implements Index---I haven't > looked into it. The big expense in Index is the mapping set or function, not the actual compare. For Janus/Ada, I had seen a similar problem (a big deal as Index was used to look for spam patterns), and finally special-cased a number of common cases (no mapping, single character patterns, and so on). I also spent a bit of time on the code generator, figuring that this sort of string manipulation code is common enough that it might as well be generated well. The updates helped a lot, although they don't quite generate a single instruction such as is possible. (OTOH, Intel used to recommend avoiding the block move and compare instructions because they fouled up the pipeline and thus slowed the overall execution. I don't know if that is still true, but ifi it is, there might be less benefit to hand-coded assembler than you are thinking...) Randy. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-03 23:01 ` Randy Brukardt @ 2007-05-04 0:28 ` Markus E Leypold 2007-05-05 16:26 ` Adam Beneschan 2007-05-04 20:04 ` Adam Beneschan 1 sibling, 1 reply; 45+ messages in thread From: Markus E Leypold @ 2007-05-04 0:28 UTC (permalink / raw) "Randy Brukardt" <randy@rrsoftware.com> writes: > "Adam Beneschan" <adam@irvine.com> wrote in message > news:1178224048.034635.39010@l77g2000hsb.googlegroups.com... > ... >> It strikes me that Index is the kind of function that really ought to >> be written in assembly language, at least partially. I notice that >> the version of Linux that I'm using has a built-in function to search >> memory for a substring; this is very descriptively called memmem() and >> has the amusing profile >> >> void *memmem (const void *needle, size_t needlelen, >> const void *haystack, size_t haystacklen); >> >> according to the man page. But I assume this is written to use >> registers optimally and take advantage of the REP instructions (on an >> x86 or Pentium). I don't know how GNAT implements Index---I haven't >> looked into it. > > The big expense in Index is the mapping set or function, not the actual > compare. For Janus/Ada, I had seen a similar problem (a big deal as Index > was used to look for spam patterns), and finally special-cased a number of > common cases (no mapping, single character patterns, and so on). I also > spent a bit of time on the code generator, figuring that this sort of string > manipulation code is common enough that it might as well be generated well. > The updates helped a lot, although they don't quite generate a single > instruction such as is possible. (OTOH, Intel used to recommend avoiding the > block move and compare instructions because they fouled up the pipeline and > thus slowed the overall execution. I don't know if that is still true, but > ifi it is, there might be less benefit to hand-coded assembler than you are > thinking...) I'd like to add, that some years ago I read a rather well written paper about this kind of text searches. Just repeating comparisons in two loops (one to move the offset in haystack, the other to compare need against haystack+offset) is actually not the best method. I.e. if one fails when comparing this haystack: ... asjkajkaewudajksdkwueqeqweadasdqw3adkadkakd needle: adasdqwsdfklsdf ^ comparison fails one can immediately increase the offset by 7, since the needle does not contain a '3' in the part that is before the point where the comparison failed. The technique I read about, used considerations of this kind to compile the 'needle' into some automaton which could skip more than 1 character at a time in haystack and then executed / interpreted this automaton. (Actually the algorithm was a bit more tricky than this, I think: If the automaton sees a character that is not in needle at all, it can immediately skip for the length of needle and I think part of the trick was to start comparing from the end of needle). So in this case a better algorithm is probably the way to go (and I agree, asembler and REP, which would just be the naive algorithm with 2 loops, won't cut it, at least I'd not bet on it) Regards -- Markus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-04 0:28 ` Markus E Leypold @ 2007-05-05 16:26 ` Adam Beneschan 2007-05-05 17:27 ` Markus E Leypold 0 siblings, 1 reply; 45+ messages in thread From: Adam Beneschan @ 2007-05-05 16:26 UTC (permalink / raw) On May 3, 5:28 pm, Markus E Leypold <development-2006-8ecbb5cc8aREMOVET...@ANDTHATm-e-leypold.de> wrote: > "Randy Brukardt" <r...@rrsoftware.com> writes: > > "Adam Beneschan" <a...@irvine.com> wrote in message > >news:1178224048.034635.39010@l77g2000hsb.googlegroups.com... > > ... > >> It strikes me that Index is the kind of function that really ought to > >> be written in assembly language, at least partially. I notice that > >> the version of Linux that I'm using has a built-in function to search > >> memory for a substring; this is very descriptively called memmem() and > >> has the amusing profile > > >> void *memmem (const void *needle, size_t needlelen, > >> const void *haystack, size_t haystacklen); > > >> according to the man page. But I assume this is written to use > >> registers optimally and take advantage of the REP instructions (on an > >> x86 or Pentium). I don't know how GNAT implements Index---I haven't > >> looked into it. > > > The big expense in Index is the mapping set or function, not the actual > > compare. For Janus/Ada, I had seen a similar problem (a big deal as Index > > was used to look for spam patterns), and finally special-cased a number of > > common cases (no mapping, single character patterns, and so on). I also > > spent a bit of time on the code generator, figuring that this sort of string > > manipulation code is common enough that it might as well be generated well. > > The updates helped a lot, although they don't quite generate a single > > instruction such as is possible. (OTOH, Intel used to recommend avoiding the > > block move and compare instructions because they fouled up the pipeline and > > thus slowed the overall execution. I don't know if that is still true, but > > ifi it is, there might be less benefit to hand-coded assembler than you are > > thinking...) > > I'd like to add, that some years ago I read a rather well written > paper about this kind of text searches. Just repeating comparisons in > two loops (one to move the offset in haystack, the other to compare > need against haystack+offset) is actually not the best method. I.e. if > one fails when comparing this > > haystack: ... asjkajkaewudajksdkwueqeqweadasdqw3adkadkakd > needle: adasdqwsdfklsdf > ^ > comparison fails > > one can immediately increase the offset by 7, since the needle does > not contain a '3' in the part that is before the point where the > comparison failed. The technique I read about, used considerations of > this kind to compile the 'needle' into some automaton which could skip > more than 1 character at a time in haystack and then executed / > interpreted this automaton. (Actually the algorithm was a bit more > tricky than this, I think: If the automaton sees a character that is > not in needle at all, it can immediately skip for the length of needle > and I think part of the trick was to start comparing from the end of > needle). > > So in this case a better algorithm is probably the way to go (and I > agree, asembler and REP, which would just be the naive algorithm with > 2 loops, won't cut it, at least I'd not bet on it) A big problem, however, is that while such "well-written" papers may be mathematically correct, they must make assumptions that aren't necessarily true in real life. For instance, analyses that I've read on searching algorithms show that this or that algorithm may minimize the number of times character comparisons are performed. But in real life, not all character comparisons have the same cost. If a processor has a string comparison instruction, for instance, then presumably the hardware has been optimized so that the characters will be compared significantly faster than if the program did all the index register manipulation itself and compared individual characters. Algorithmic analysis, or at least the kind I've seen, usually doesn't take this sort of thing into account. If one pattern-matching algorithm can be shown to be better than another, then (for a particular processor) there should be some values M and N such that if the source length is >M and the pattern length is >N, the mathematically faster algorithm actually does run faster---but that doesn't help if your real-life inputs are likely to have lengths significantly less than that. The original poster had a pattern with a quite small length, for instance. A technique in which you precompile the pattern into an automaton, for instance, is likely to make things slower when you're using such a small pattern and calling the Index routine thousands or millions of times. This doesn't necessarily help write the Index routine in the best way. In this case, an implementation *could* look at the lengths of the strings and make a judgment as to what the best algorithm might be (the implementor may have to experiment on each supported processor, though, to determine the correct cutoffs). But I think it would be a mistake for an implementor to write Index in a way that uses some "optimal" algorithm that makes things better only when the argument sizes are >1000, say, and makes things slower when the sizes are less than that. -- Adam ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-05 16:26 ` Adam Beneschan @ 2007-05-05 17:27 ` Markus E Leypold 2007-05-15 23:03 ` Randy Brukardt 0 siblings, 1 reply; 45+ messages in thread From: Markus E Leypold @ 2007-05-05 17:27 UTC (permalink / raw) Adam Beneschan <adam@irvine.com> writes: > On May 3, 5:28 pm, Markus E Leypold > <development-2006-8ecbb5cc8aREMOVET...@ANDTHATm-e-leypold.de> wrote: >> "Randy Brukardt" <r...@rrsoftware.com> writes: >> > "Adam Beneschan" <a...@irvine.com> wrote in message >> >news:1178224048.034635.39010@l77g2000hsb.googlegroups.com... >> > ... >> >> It strikes me that Index is the kind of function that really ought to >> >> be written in assembly language, at least partially. I notice that >> >> the version of Linux that I'm using has a built-in function to search >> >> memory for a substring; this is very descriptively called memmem() and >> >> has the amusing profile >> >> >> void *memmem (const void *needle, size_t needlelen, >> >> const void *haystack, size_t haystacklen); >> >> >> according to the man page. But I assume this is written to use >> >> registers optimally and take advantage of the REP instructions (on an >> >> x86 or Pentium). I don't know how GNAT implements Index---I haven't >> >> looked into it. >> >> > The big expense in Index is the mapping set or function, not the actual >> > compare. For Janus/Ada, I had seen a similar problem (a big deal as Index >> > was used to look for spam patterns), and finally special-cased a number of >> > common cases (no mapping, single character patterns, and so on). I also >> > spent a bit of time on the code generator, figuring that this sort of string >> > manipulation code is common enough that it might as well be generated well. >> > The updates helped a lot, although they don't quite generate a single >> > instruction such as is possible. (OTOH, Intel used to recommend avoiding the >> > block move and compare instructions because they fouled up the pipeline and >> > thus slowed the overall execution. I don't know if that is still true, but >> > ifi it is, there might be less benefit to hand-coded assembler than you are >> > thinking...) >> >> I'd like to add, that some years ago I read a rather well written >> paper about this kind of text searches. Just repeating comparisons in >> two loops (one to move the offset in haystack, the other to compare >> need against haystack+offset) is actually not the best method. I.e. if >> one fails when comparing this >> >> haystack: ... asjkajkaewudajksdkwueqeqweadasdqw3adkadkakd >> needle: adasdqwsdfklsdf >> ^ >> comparison fails >> >> one can immediately increase the offset by 7, since the needle does >> not contain a '3' in the part that is before the point where the >> comparison failed. The technique I read about, used considerations of >> this kind to compile the 'needle' into some automaton which could skip >> more than 1 character at a time in haystack and then executed / >> interpreted this automaton. (Actually the algorithm was a bit more >> tricky than this, I think: If the automaton sees a character that is >> not in needle at all, it can immediately skip for the length of needle >> and I think part of the trick was to start comparing from the end of >> needle). >> >> So in this case a better algorithm is probably the way to go (and I >> agree, asembler and REP, which would just be the naive algorithm with >> 2 loops, won't cut it, at least I'd not bet on it) > > A big problem, however, is that while such "well-written" papers may > be mathematically correct, they must make assumptions that aren't > necessarily true in real life. For instance, analyses that I've read While that is certainly true, I don't think it applies in this case, The algorithm I'm referring to, and which I'm to lazy to look up, seems to be rather much in use when looking for a give substring in a larger string/text, like when searching or searching/replacing in a text editor. > on searching algorithms show that this or that algorithm may minimize > the number of times character comparisons are performed. But in real > life, not all character comparisons have the same cost. If a > processor has a string comparison instruction, for instance, then > presumably the hardware has been optimized so that the characters will > be compared significantly faster than if the program did all the index > register manipulation itself and compared individual characters. The processor hardware can only search of strings in a "dumb" way, whereas the algorithm I have been talking about can be much smarter. Of course I'd expect algorithms like that to have the best (and most predictable) performance on RISC machines. And it might well pay to implement the "interpreter" in assembly too. > Algorithmic analysis, or at least the kind I've seen, usually doesn't > take this sort of thing into account. If one pattern-matching Isn't that some kind of sweeping generalization? > algorithm can be shown to be better than another, then (for a > particular processor) there should be some values M and N such that if > the source length is >M and the pattern length is >N, the > mathematically faster algorithm actually does run faster---but that > doesn't help if your real-life inputs are likely to have lengths > significantly less than that. The original poster had a pattern with > a quite small length, for instance. A technique in which you > precompile the pattern into an automaton, for instance, is likely to > make things slower when you're using such a small pattern and calling > the Index routine thousands or millions of times. Well -- I'm not trying to peddle that algorithm to you or anybody else. Indeed I now wish. I've stayed silent instead. Originally I intended to point people, who try to find substrings to the fact that there a much smarter algorithms available which can skip a significant percentage of character comparisons. Wether this is faster than brute force dumb comparison would have to be decided in any single case. Ideally though, in the case of a substring of N characters, when a single character comparison fails, offset can be skipped by a K in the order of magnitude of N, that is, if you look for a substring of N characters, you might only have to do something on the order of 1/N of the comparisons you'd have to do in the dumb implementation. Admittedly things are more complicated than that, but I think one can see, that the point where it becomes as fast as the dumb case could be reached for small N, esp. if the interpreter is handwritten assembler or generated by a really good code generator. As I said: I'm not trying to peddle that algorithm to anybody, so I'm not inclined to look it up or try a better analysis. I've better things to do, since _I_ don't have problem with searching for substrings at the moment. > This doesn't necessarily help write the Index routine in the best > way. In this case, an implementation *could* look at the lengths of > the strings and make a judgment as to what the best algorithm might be > (the implementor may have to experiment on each supported processor, > though, to determine the correct cutoffs). But I think it would be a > mistake for an implementor to write Index in a way that uses some > "optimal" algorithm that makes things better only when the argument > sizes are >1000, say, and makes things slower when the sizes are less > than that. If I understood the "paper" right (actually it was only some dumb article in some applied programming magazine), the algorithm had been designed for simple word searched, i.e. substrings around 10 characters. Regards -- Markus ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-05 17:27 ` Markus E Leypold @ 2007-05-15 23:03 ` Randy Brukardt 0 siblings, 0 replies; 45+ messages in thread From: Randy Brukardt @ 2007-05-15 23:03 UTC (permalink / raw) (I've been on vacation and just am now catching up...) "Markus E Leypold" <development-2006-8ecbb5cc8aREMOVETHIS@ANDTHATm-e-leypold.de> wrote in message news:9qhcqr5gfi.fsf@hod.lan.m-e-leypold.de... > Adam Beneschan <adam@irvine.com> writes: > > > On May 3, 5:28 pm, Markus E Leypold > > <development-2006-8ecbb5cc8aREMOVET...@ANDTHATm-e-leypold.de> wrote: > >> "Randy Brukardt" <r...@rrsoftware.com> writes: > >> > "Adam Beneschan" <a...@irvine.com> wrote in message > >> >news:1178224048.034635.39010@l77g2000hsb.googlegroups.com... > >> > ... > >> >> It strikes me that Index is the kind of function that really ought to > >> >> be written in assembly language, at least partially. I notice that > >> >> the version of Linux that I'm using has a built-in function to search > >> >> memory for a substring; this is very descriptively called memmem() and > >> >> has the amusing profile > >> > >> >> void *memmem (const void *needle, size_t needlelen, > >> >> const void *haystack, size_t haystacklen); > >> > >> >> according to the man page. But I assume this is written to use > >> >> registers optimally and take advantage of the REP instructions (on an > >> >> x86 or Pentium). I don't know how GNAT implements Index---I haven't > >> >> looked into it. > >> > >> > The big expense in Index is the mapping set or function, not the actual > >> > compare. For Janus/Ada, I had seen a similar problem (a big deal as Index > >> > was used to look for spam patterns), and finally special-cased a number of > >> > common cases (no mapping, single character patterns, and so on). I also > >> > spent a bit of time on the code generator, figuring that this sort of string > >> > manipulation code is common enough that it might as well be generated well. > >> > The updates helped a lot, although they don't quite generate a single > >> > instruction such as is possible. (OTOH, Intel used to recommend avoiding the > >> > block move and compare instructions because they fouled up the pipeline and > >> > thus slowed the overall execution. I don't know if that is still true, but > >> > ifi it is, there might be less benefit to hand-coded assembler than you are > >> > thinking...) > >> > >> I'd like to add, that some years ago I read a rather well written > >> paper about this kind of text searches. Just repeating comparisons in > >> two loops (one to move the offset in haystack, the other to compare > >> need against haystack+offset) is actually not the best method. I.e. if > >> one fails when comparing this > >> > >> haystack: ... asjkajkaewudajksdkwueqeqweadasdqw3adkadkakd > >> needle: adasdqwsdfklsdf > >> ^ > >> comparison fails > >> > >> one can immediately increase the offset by 7, since the needle does > >> not contain a '3' in the part that is before the point where the > >> comparison failed. The technique I read about, used considerations of > >> this kind to compile the 'needle' into some automaton which could skip > >> more than 1 character at a time in haystack and then executed / > >> interpreted this automaton. (Actually the algorithm was a bit more > >> tricky than this, I think: If the automaton sees a character that is > >> not in needle at all, it can immediately skip for the length of needle > >> and I think part of the trick was to start comparing from the end of > >> needle). > >> > >> So in this case a better algorithm is probably the way to go (and I > >> agree, asembler and REP, which would just be the naive algorithm with > >> 2 loops, won't cut it, at least I'd not bet on it) > > > > A big problem, however, is that while such "well-written" papers may > > be mathematically correct, they must make assumptions that aren't > > necessarily true in real life. For instance, analyses that I've read > > While that is certainly true, I don't think it applies in this case, > The algorithm I'm referring to, and which I'm to lazy to look up, > seems to be rather much in use when looking for a give substring in a > larger string/text, like when searching or searching/replacing in a > text editor. No, Adam is right, it applies in every case. A brute force search is often cheaper than a fancier one. For instance, in the Janus/Ada parser, I implemented a fancy binary table search in order to find items faster. It actually made things slower. The problem was that many of the table rows only had a few items, so the setup time (which required a number of adds and shifts) took longer than a brute-force search (which was just a simple for loop, one add per iteration). Hand coding that loop in assembler turned it into a handful of instructions. Brute force is almost always going to be better if the comparisons are cheap and you can keep everything in registers. That's often true for Index. ... > Originally I intended to point people, who try to find substrings to > the fact that there a much smarter algorithms available which can skip > a significant percentage of character comparisons. Wether this is > faster than brute force dumb comparison would have to be decided in > any single case. Ideally though, in the case of a substring of N > characters, when a single character comparison fails, offset can be > skipped by a K in the order of magnitude of N, that is, if you look > for a substring of N characters, you might only have to do something > on the order of 1/N of the comparisons you'd have to do in the dumb > implementation. Admittedly things are more complicated than that, but > I think one can see, that the point where it becomes as fast as the > dumb case could be reached for small N, esp. if the interpreter is > handwritten assembler or generated by a really good code generator. Here the problem is that you have to know something about N in order to tell. My experience is that N is usually small (often 1), and in such cases there is likely no value to a fancy algorithm. Of course, in a general routine like Index, pretty much anything can happen, and it is highly unlikely that the algorithm will be optimal for every possible case. Truth is, this demonstrates a standard moral of programming: its fine to write your programs with a standard library routine, but you need to profile it after doing so to see where the bottlenecks are. And if you're going to call anything (but especially some canned library routine) millions of times, you need to be prepared to replace it with one that is optimal for your application. In this case, for instance, you can figure out the distribution of N for a particular program, and implement an appropriate algorithm. You can't expect us compiler writers to do it for you; we may have a very different distribution of N in mind. (Of course, if the routine isn't critical, don't sweat it -- doing otherwise is just premature optimization.) Randy. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-03 23:01 ` Randy Brukardt 2007-05-04 0:28 ` Markus E Leypold @ 2007-05-04 20:04 ` Adam Beneschan 2007-05-05 16:36 ` tmoran 1 sibling, 1 reply; 45+ messages in thread From: Adam Beneschan @ 2007-05-04 20:04 UTC (permalink / raw) On May 3, 4:01 pm, "Randy Brukardt" <r...@rrsoftware.com> wrote: > The big expense in Index is the mapping set or function, not the actual > compare. Yeah, but I'd assume that a high proportion of uses of Index would use the default parameter for mapping (Maps.Identity), which an implementation should be able to test for and choose a faster implementation in that case. Which is what I assumed you did. > For Janus/Ada, I had seen a similar problem (a big deal as Index > was used to look for spam patterns), and finally special-cased a number of > common cases (no mapping, single character patterns, and so on). I also > spent a bit of time on the code generator, figuring that this sort of string > manipulation code is common enough that it might as well be generated well. > The updates helped a lot, although they don't quite generate a single > instruction such as is possible. (OTOH, Intel used to recommend avoiding the > block move and compare instructions because they fouled up the pipeline and > thus slowed the overall execution. I don't know if that is still true, but > ifi it is, there might be less benefit to hand-coded assembler than you are > thinking...) Sigh... I've been in the computer business too long. I started out with processors that execute the instruction you tell it to execute and then go to the next one, so I still tend to think in those terms, even though I now have to deal with processors that look ahead to schedule instructions and try to execute instructions in parallel and even have "branch delays" so that if you tell the computer to branch, it will sit there and not budge until you give it five more instructions, which seems WAY too much like dealing with my 8-year-old son. But yes, these days I need to be more careful about my assumptions about what kinds of machine instructions will execute the fastest. -- Adam ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: Reading and writing a big file in Ada (GNAT) on Windows XP 2007-05-04 20:04 ` Adam Beneschan @ 2007-05-05 16:36 ` tmoran 0 siblings, 0 replies; 45+ messages in thread From: tmoran @ 2007-05-05 16:36 UTC (permalink / raw) >Yeah, but I'd assume that a high proportion of uses of Index would use >... > But yes, these days I need to be more careful about my assumptions about > what kinds of machine instructions will execute the fastest. Different versions of the same processor architecture (ie, compiler target) are different, different uses of Index have different statistics, with multicore it may be worth splitting up the job, ..... Sometimes it's worth rolling your own instead of using the library routine. ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2007-05-15 23:03 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-04-28 5:03 Reading and writing a big file in Ada (GNAT) on Windows XP Fionn Mac Cumhaill 2007-04-28 5:20 ` Gautier 2007-04-29 21:17 ` Fionn Mac Cumhaill 2007-04-28 5:25 ` tmoran 2007-04-28 6:56 ` Martin Krischik 2007-04-28 17:12 ` tmoran 2007-04-28 12:41 ` Jeffrey Creem 2007-04-29 21:35 ` Fionn Mac Cumhaill 2007-04-28 13:22 ` (see below) 2007-04-28 17:56 ` Simon Wright 2007-04-28 18:28 ` Jeffrey Creem 2007-04-29 7:20 ` Simon Wright 2007-04-29 21:44 ` Fionn Mac Cumhaill 2007-04-29 21:42 ` Fionn Mac Cumhaill 2007-04-30 0:48 ` Jeffrey R. Carter 2007-04-30 2:30 ` Fionn Mac Cumhaill 2007-04-30 4:21 ` tmoran 2007-04-28 19:12 ` Jeffrey R. Carter 2007-04-29 21:46 ` Fionn Mac Cumhaill 2007-05-01 14:10 ` Fionn Mac Cumhaill 2007-05-06 21:55 ` Quarc 2007-05-02 7:46 ` george 2007-05-03 6:31 ` Fionn Mac Cumhaill 2007-05-03 20:00 ` Simon Wright 2007-05-04 4:35 ` Jeffrey R. Carter 2007-05-04 4:45 ` Fionn Mac Cumhaill 2007-05-04 6:53 ` Alternative Index implementation? (Was: Reading and writing a big file in Ada (GNAT) on Windows XP) Jacob Sparre Andersen 2007-05-04 7:41 ` Dmitry A. Kazakov 2007-05-04 9:16 ` Copying string slices before calling subroutines? (Was: Alternative Index implementation?) Jacob Sparre Andersen 2007-05-04 9:44 ` Copying string slices before calling subroutines? Jacob Sparre Andersen 2007-05-04 10:14 ` Dmitry A. Kazakov 2007-05-04 12:07 ` Jeffrey Creem 2007-05-04 12:46 ` Dmitry A. Kazakov 2007-05-04 22:27 ` Simon Wright 2007-05-05 7:33 ` Jacob Sparre Andersen 2007-05-05 7:47 ` Dmitry A. Kazakov 2007-05-05 7:41 ` Dmitry A. Kazakov 2007-05-03 20:27 ` Reading and writing a big file in Ada (GNAT) on Windows XP Adam Beneschan 2007-05-03 23:01 ` Randy Brukardt 2007-05-04 0:28 ` Markus E Leypold 2007-05-05 16:26 ` Adam Beneschan 2007-05-05 17:27 ` Markus E Leypold 2007-05-15 23:03 ` Randy Brukardt 2007-05-04 20:04 ` Adam Beneschan 2007-05-05 16:36 ` tmoran
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox