sorting large numbers of large records

comp.lang.ada
 help / color / mirror / Atom feed

* sorting large numbers of large records
@ 2003-07-28 15:29 Brien L. Christesen
  2003-07-28 15:35 ` Vinzent Hoefler
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Brien L. Christesen @ 2003-07-28 15:29 UTC (permalink / raw)




First off, I'm an Ada newbie, I've only been using it for about a month
now.

I am trying to sort ~1,000,000 (with potential to grow to 60 mil)
records of 256 bytes each in Ada. The records are written to a binary
file. Right now, I read in the file, and divide it into files greater
than and less than a pivot value.  I then recursively call this division
function until it reaches a set less than 100,000 records (if I make this
number any larger I get a stack overflow on my windows box) it creates an
array of the records, calls quicksort on it, and appends the result to a
final results file.

I couldn't really find much discussion of doing external sorts in Ada (or
much literature on external sorts at all) so I'm not sure if this is a
good approach to be taking, or if there is a better way to do the sort.  I
tried creating an index, sorting the index, and then creating the results
file based on the index.  Jumping around the file with direct_io led to
awful results, though.  The execution time of the part of the code that
read the index, got the corresponding value from the large record file,
and wrote that record to a new file grew exponentially.

Does anyone know of a good way to do this kind of sort?  Thanks in advance
for any responses.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-28 15:29 Brien L. Christesen
@ 2003-07-28 15:35 ` Vinzent Hoefler
  2003-07-31 15:22   ` Brien L. Christesen
  2003-07-28 16:25 ` Hyman Rosen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Vinzent Hoefler @ 2003-07-28 15:35 UTC (permalink / raw)


Brien L. Christesen wrote:

>I couldn't really find much discussion of doing external sorts in Ada (or
>much literature on external sorts at all)

<URL:http://opal.cs.binghamton.edu/~zzhang/ex_sort/intro.html>.
<URL:http://www.cs.ust.hk/faculty/dimitris/COMP530/notes/l7b.pdf>?


Vinzent.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-28 15:29 Brien L. Christesen
  2003-07-28 15:35 ` Vinzent Hoefler
@ 2003-07-28 16:25 ` Hyman Rosen
  2003-07-28 20:30 ` John R. Strohm
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Hyman Rosen @ 2003-07-28 16:25 UTC (permalink / raw)


You should read Knuth's Sorting and Searching.
The classic way to do this is to read as many records
from the file as you can fit in memory, sort them, and
write them back out to another file. Keep doing this
until you have sorted all of your original records.
Now merge all the sorted subfiles together. You do that
by reading a record from each subfile, writing out the
smallest, and reading a new record from the file which
provided the record just written.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-28 15:29 Brien L. Christesen
  2003-07-28 15:35 ` Vinzent Hoefler
  2003-07-28 16:25 ` Hyman Rosen
@ 2003-07-28 20:30 ` John R. Strohm
  2003-07-28 20:52   ` Hyman Rosen
  2003-07-28 23:33 ` Matthew Heaney
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: John R. Strohm @ 2003-07-28 20:30 UTC (permalink / raw)


"Brien L. Christesen" <blchrist@rac2.wam.umd.edu> wrote in message
news:bg3fgg$qig$1@grapevine.wam.umd.edu...
>
>
> First off, I'm an Ada newbie, I've only been using it for about a month
> now.
>
> I am trying to sort ~1,000,000 (with potential to grow to 60 mil)
> records of 256 bytes each in Ada. The records are written to a binary
> file. Right now, I read in the file, and divide it into files greater
> than and less than a pivot value.  I then recursively call this division
> function until it reaches a set less than 100,000 records (if I make this
> number any larger I get a stack overflow on my windows box) it creates an
> array of the records, calls quicksort on it, and appends the result to a
> final results file.
>
> I couldn't really find much discussion of doing external sorts in Ada (or
> much literature on external sorts at all) so I'm not sure if this is a
> good approach to be taking, or if there is a better way to do the sort.  I
> tried creating an index, sorting the index, and then creating the results
> file based on the index.  Jumping around the file with direct_io led to
> awful results, though.  The execution time of the part of the code that
> read the index, got the corresponding value from the large record file,
> and wrote that record to a new file grew exponentially.
>
> Does anyone know of a good way to do this kind of sort?  Thanks in advance
> for any responses.

The canonical reference on sorting is Knuth, Vol. 3, "Sorting and
Searching".

At 1 million records of 256 bytes each, you can probably do an in-core sort.
At 60 million records, you probably can't.  Plan on sorting chunks and then
merging the sorted chunks.

Making an index and sorting that is the second key to sort performance,
after choosing the best sorting algorithm you can.

Incidentally, depending on your system, it might be to your advantage to
maintain TWO datafiles: one big one that is sorted, and one little one that
isn't, but contains things that have been modified recently.  Whenever you
go to modify a record, you first search the little file, to see if it has
already been modified.  If it has, it will be in the little file, and you
just use that one.  If not, you copy it out of the big file, edit it, and
put it in the little file.  When it becomes necessary to have the whole file
sorted, you sort the little file and then merge the two, removing duplicate
(old) records.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-28 20:30 ` John R. Strohm
@ 2003-07-28 20:52   ` Hyman Rosen
  2003-07-28 23:47     ` Matthew Heaney
  0 siblings, 1 reply; 18+ messages in thread
From: Hyman Rosen @ 2003-07-28 20:52 UTC (permalink / raw)


John R. Strohm wrote:
> Incidentally, depending on your system, it might be to your advantage to
> maintain TWO datafiles: one big one that is sorted, and one little one that
> isn't, but contains things that have been modified recently.

If he reaches this point, the OP might just want to
move to a real database, where problems like this
havel already been dealt with.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-28 15:29 Brien L. Christesen
                   ` (2 preceding siblings ...)
  2003-07-28 20:30 ` John R. Strohm
@ 2003-07-28 23:33 ` Matthew Heaney
  2003-07-28 23:43 ` Matthew Heaney
  2003-07-29  0:42 ` John Cupak
  5 siblings, 0 replies; 18+ messages in thread
From: Matthew Heaney @ 2003-07-28 23:33 UTC (permalink / raw)



"Brien L. Christesen" <blchrist@rac2.wam.umd.edu> wrote in message
news:bg3fgg$qig$1@grapevine.wam.umd.edu...
>
> Does anyone know of a good way to do this kind of sort?  Thanks in advance
> for any responses.

The list containers in the Charles library have a (quick)sort operation.
You could read the records into a singly-linked list container, and then
sort that.  Lists allow the sort to be done without copying elements, by
exchanging pointers.

Alternatively, you could use a vector or array, and then use the generic
sort algorithm (also part of the library) to sort that.  A vector is more
space efficient, but a sort requires copying of elements.

Another option is to read the records into a set, which automatically sorts
elements as they are inserted into the container.  (Sets and maps are
implemented using a balanced tree.)

A set is probably overkill, though, because you don't need do lookups or
anything.  If all you want to do is sort, then try using a list.

The Charles library is available from my website.

http://home.earthlink.net/~matthewjheaney/charles/

Look for a new release of Charles in the next few days.  I am currently
synchronizing the interfaces of the hashed vs. sorted associative
containers.

Send me email if you have any specific library questions.

-Matt





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-28 15:29 Brien L. Christesen
                   ` (3 preceding siblings ...)
  2003-07-28 23:33 ` Matthew Heaney
@ 2003-07-28 23:43 ` Matthew Heaney
  2003-07-29  0:42 ` John Cupak
  5 siblings, 0 replies; 18+ messages in thread
From: Matthew Heaney @ 2003-07-28 23:43 UTC (permalink / raw)



"Brien L. Christesen" <blchrist@rac2.wam.umd.edu> wrote in message
news:bg3fgg$qig$1@grapevine.wam.umd.edu...
>
>
> I couldn't really find much discussion of doing external sorts in Ada (or
> much literature on external sorts at all) so I'm not sure if this is a
> good approach to be taking, or if there is a better way to do the sort.  I
> tried creating an index, sorting the index, and then creating the results
> file based on the index.  Jumping around the file with direct_io led to
> awful results, though.  The execution time of the part of the code that
> read the index, got the corresponding value from the large record file,
> and wrote that record to a new file grew exponentially.
>
> Does anyone know of a good way to do this kind of sort?  Thanks in advance
> for any responses.

I forgot to mention that I also wrote both indexed i/o and indexed
sequential i/o containers.  These might be useful for minimizing how many
elements are kept in main memory.  (The indexed i/o files are implemented
using a B-tree.)

http://www.adapower.com/reuse/indexed_io.html


-Matt





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-28 20:52   ` Hyman Rosen
@ 2003-07-28 23:47     ` Matthew Heaney
  0 siblings, 0 replies; 18+ messages in thread
From: Matthew Heaney @ 2003-07-28 23:47 UTC (permalink / raw)



"Hyman Rosen" <hyrosen@mail.com> wrote in message
news:1059425531.850814@master.nyc.kbcfp.com...
> John R. Strohm wrote:
> > Incidentally, depending on your system, it might be to your advantage to
> > maintain TWO datafiles: one big one that is sorted, and one little one
that
> > isn't, but contains things that have been modified recently.
>
> If he reaches this point, the OP might just want to
> move to a real database, where problems like this
> havel already been dealt with.
>

The indexed i/o components I wrote will probably meet his needs.  I'm not
sure he really needs a full-blown database.

http://www.adapower.com/reuse/indexed_io.html

See also the Charles library for internal sorting methods:

http://home.earthlink.net/~matthewjheaney/charles

-Matt





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-28 15:29 Brien L. Christesen
                   ` (4 preceding siblings ...)
  2003-07-28 23:43 ` Matthew Heaney
@ 2003-07-29  0:42 ` John Cupak
  2003-07-29  3:38   ` Matthew Heaney
  2003-07-29  8:32   ` Preben Randhol
  5 siblings, 2 replies; 18+ messages in thread
From: John Cupak @ 2003-07-29  0:42 UTC (permalink / raw)


As much as I just LOVE Ada, this reminds me of my favorite interview
questions
a few years back - which might just solve Brien's problems.

"Well, young man," I used to ask, "how would you sort 20 records?"

The interviewee would typically respond "I'd use the bubble sort."

I'd listen to his answer, and then ask "and how would you sort 1000
records?"

They'd stumble around on that one for a bit, until they remembered something
about the quick sort, and propose to use that algorithm.

Then I'd spring the final question. "How would you sort 1,000,000 records?"

They'd scratch their heads and usually offer the quick sort again.

But, if someone answered "I'd call the system sort routine" for EVERY
question,
 then I knew I was interviewing a COBOL programmer. They NEVER wrote a sort!

So, I'd suggest looking into the system "sort" routine on whatever computer
you're using. I sure hope "Mr. Bill" (Gates, that is) thought of one.

Not EVERYTHING needs to be in Ada! (gasp!)

John Cupak

"Brien L. Christesen" <blchrist@rac2.wam.umd.edu> wrote in message
news:bg3fgg$qig$1@grapevine.wam.umd.edu...
>
>
> First off, I'm an Ada newbie, I've only been using it for about a month
> now.
>
> I am trying to sort ~1,000,000 (with potential to grow to 60 mil)
> records of 256 bytes each in Ada. The records are written to a binary
> file. Right now, I read in the file, and divide it into files greater
> than and less than a pivot value.  I then recursively call this division
> function until it reaches a set less than 100,000 records (if I make this
> number any larger I get a stack overflow on my windows box) it creates an
> array of the records, calls quicksort on it, and appends the result to a
> final results file.
>
> I couldn't really find much discussion of doing external sorts in Ada (or
> much literature on external sorts at all) so I'm not sure if this is a
> good approach to be taking, or if there is a better way to do the sort.  I
> tried creating an index, sorting the index, and then creating the results
> file based on the index.  Jumping around the file with direct_io led to
> awful results, though.  The execution time of the part of the code that
> read the index, got the corresponding value from the large record file,
> and wrote that record to a new file grew exponentially.
>
> Does anyone know of a good way to do this kind of sort?  Thanks in advance
> for any responses.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-29  0:42 ` John Cupak
@ 2003-07-29  3:38   ` Matthew Heaney
  2003-07-29  8:32   ` Preben Randhol
  1 sibling, 0 replies; 18+ messages in thread
From: Matthew Heaney @ 2003-07-29  3:38 UTC (permalink / raw)



"John Cupak" <jcupak744@comcast.net> wrote in message
news:4qjVa.3814$It4.1937@rwcrnsc51.ops.asp.att.net...
>
> So, I'd suggest looking into the system "sort" routine on whatever
computer
> you're using. I sure hope "Mr. Bill" (Gates, that is) thought of one.
>
> Not EVERYTHING needs to be in Ada! (gasp!)

Eventually, when I get around to implementing the full suite of generic
algorithms, the Charles library should meet most sorting needs.  However,
the library is useable as is right now, and you can indeed sort every kind
of container in the library.

http://home.earthlink.net/~matthewjheaney/charles/

I have been busy tweaking the containers.  There will be a new release
either this week or early next week.  I will then be devoting the month of
August (2003) to writing a proposal in response to AI-302, for a standard
container library.






^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-29  0:42 ` John Cupak
  2003-07-29  3:38   ` Matthew Heaney
@ 2003-07-29  8:32   ` Preben Randhol
  1 sibling, 0 replies; 18+ messages in thread
From: Preben Randhol @ 2003-07-29  8:32 UTC (permalink / raw)


John Cupak wrote:
> So, I'd suggest looking into the system "sort" routine on whatever computer
> you're using. I sure hope "Mr. Bill" (Gates, that is) thought of one.

I would rather he sort his bugs.

-- 
Ada95 is good for you.
http://www.crystalcode.com/codemage/MainMenu/Coding/Ada/IntroducingAda.php



^ permalink raw reply	[flat|nested] 18+ messages in thread

* sorting large numbers of large records
@ 2003-07-29 13:10 Brien L. Christesen
  2003-07-29 14:30 ` Larry Kilgallen
  2003-07-30  0:32 ` Keith Thompson
  0 siblings, 2 replies; 18+ messages in thread
From: Brien L. Christesen @ 2003-07-29 13:10 UTC (permalink / raw)


That is a good point, and I looked into the unix sort command.  The only
problem is that as far as I can tell, that only sorts text files.  I have
a file of binary records, so I have no idea how I could use a system sort
command to do it.  Is there any way that would work?

thanks for the feedback!
Brien




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-29 13:10 sorting large numbers of large records Brien L. Christesen
@ 2003-07-29 14:30 ` Larry Kilgallen
  2003-07-30  0:32 ` Keith Thompson
  1 sibling, 0 replies; 18+ messages in thread
From: Larry Kilgallen @ 2003-07-29 14:30 UTC (permalink / raw)

In article <bg5rol$rnp$1@grapevine.wam.umd.edu>, "Brien L. Christesen" <blchrist@rac1.wam.umd.edu> writes:

> That is a good point, and I looked into the unix sort command.  The only
> problem is that as far as I can tell, that only sorts text files.  I have
> a file of binary records, so I have no idea how I could use a system sort
> command to do it.  Is there any way that would work?

Even the complex Sort Specification format on VMS does not provide for
sorting based on non-ascii fields.  To use the built-in VMS sort facility
on non-ascii fields, you have to use the Callable Sort entrypoints, giving
your own User Compare routine.

Perhaps the same thing is true on one or more flavors of Unix.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-29 13:10 sorting large numbers of large records Brien L. Christesen
  2003-07-29 14:30 ` Larry Kilgallen
@ 2003-07-30  0:32 ` Keith Thompson
  2003-07-30  1:53   ` Hyman Rosen
  1 sibling, 1 reply; 18+ messages in thread
From: Keith Thompson @ 2003-07-30  0:32 UTC (permalink / raw)

"Brien L. Christesen" <blchrist@rac1.wam.umd.edu> writes:
> That is a good point, and I looked into the unix sort command.  The only
> problem is that as far as I can tell, that only sorts text files.  I have
> a file of binary records, so I have no idea how I could use a system sort
> command to do it.  Is there any way that would work?

Translate your binary records into a sortable text format, one record
per line, sort the resulting text file, and translate back into your
binary format.

As long as a line doesn't contain NUL or linefeed characters, you
should be ok.  If you're using GNU sort, the length of each line can
be unlimited; for a non-GNU Unix sort program, there may be some
limit, but it's probably at least 1024 characters.  If you don't have
GNU sort, consider installing it; it's part of the GNU coreutils
package at <ftp://ftp.gnu.org/gnu/coreutils/>.

Put the sort key in a fixed-width field at the beginning of the line,
in a form that can be sorted by a simple string comparison (Unix sort
can do numeric comparisons, but they're slower).  You can probably
make the other fields fixed-width or variable-width, whichever is more
convenient.

Don't worry too much about making the text format human-readable.
Consider using hexadecimal rather than decimal for integer fields; it
sorts just as well (when treated as strings) and may be cheaper to
convert.  You can even use raw hexadecimal for floating-point and
other binary fields (convert the representation, not the value), as
long as you're not using them as part of the sort key; the only
requirement is that you're able to recover the binary values from the
strings.        

Carefully read the man page for the sort command on your system.  If
your program needs to be portable to multiple Unix-like systems, read
the man page on all of them.  Pay attention to any limitations on line
length.  Use whatever options are needed to turn off any
locale-specific behavior; you need raw bytewise ASCII collation, not
something that knows about accented letters.  For GNU sort, set the
environment variable $LC_ALL to "C".

Finally, I'm not sure what GNU sort (or any other Unix-like sort) does
with input too big to fit into memory; read the documentation and/or
experiment to make sure it fits your needs.  I know that GNU sort has
an option to tell it where to put temporary files, so apparently it
uses temporary files somehow.

-- 
Keith Thompson (The_Other_Keith) kst@cts.com  <http://www.ghoti.net/~kst>
San Diego Supercomputer Center           <*>  <http://www.sdsc.edu/~kst>
Schroedinger does Shakespeare: "To be *and* not to be"

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-30  0:32 ` Keith Thompson
@ 2003-07-30  1:53   ` Hyman Rosen
  2003-07-30 14:55     ` Matthew Heaney
  0 siblings, 1 reply; 18+ messages in thread
From: Hyman Rosen @ 2003-07-30  1:53 UTC (permalink / raw)


Keith Thompson wrote:
 > If you're using GNU sort, the length of each line can be unlimited;

So there :-)

> Finally, I'm not sure what GNU sort (or any other Unix-like sort) does
> with input too big to fit into memory;

As far as I know, UNIX sort has always been able to sort arbitrarily large
files by doing what I suggested in an earlier message - sort pieces of the
original into temporary files, then merge them together.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-30  1:53   ` Hyman Rosen
@ 2003-07-30 14:55     ` Matthew Heaney
  2003-07-30 16:41       ` Chad R. Meiners
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Heaney @ 2003-07-30 14:55 UTC (permalink / raw)


Hyman Rosen <hyrosen@mail.com> wrote in message news:<jyFVa.715$7h6.251@nwrdny03.gnilink.net>...
> Keith Thompson wrote:
>  > If you're using GNU sort, the length of each line can be unlimited;
> 
> So there :-)
> 
> > Finally, I'm not sure what GNU sort (or any other Unix-like sort) does
> > with input too big to fit into memory;
> 
> As far as I know, UNIX sort has always been able to sort arbitrarily large
> files by doing what I suggested in an earlier message - sort pieces of the
> original into temporary files, then merge them together.

Actually, since these are fixed size records, you could do a sort by
simply exchanging record values on disk.  You have random access via
Direct_IO.

You could also map the entire file, to view it as one large array.

In either case you could use a generic algorithm (the one in Charles
would be adequate) to do the work.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-30 14:55     ` Matthew Heaney
@ 2003-07-30 16:41       ` Chad R. Meiners
  0 siblings, 0 replies; 18+ messages in thread
From: Chad R. Meiners @ 2003-07-30 16:41 UTC (permalink / raw)



"Matthew Heaney" <mheaney@on2.com> wrote in message
news:1ec946d1.0307300655.2ad33d81@posting.google.com...
> You could also map the entire file, to view it as one large array.

Doing so would be very slow compared to the algorithms Vinzent Hoefler
referenced.





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sorting large numbers of large records
  2003-07-28 15:35 ` Vinzent Hoefler
@ 2003-07-31 15:22   ` Brien L. Christesen
  0 siblings, 0 replies; 18+ messages in thread
From: Brien L. Christesen @ 2003-07-31 15:22 UTC (permalink / raw)

Thanks for the pointers to the information.  I ended up using something
like the knuth approach, with some buffering of IO built in.  Now 1M
records of 256 bytes are sorting in as little as 2.5 minutes, and the
execution time is growing almost linearly.  Thanks for all of the help
everyone!
Brien

Vinzent Hoefler <ada.rocks@jlfencey.com> wrote:
> Brien L. Christesen wrote:

>>I couldn't really find much discussion of doing external sorts in Ada (or
>>much literature on external sorts at all)

> <URL:http://opal.cs.binghamton.edu/~zzhang/ex_sort/intro.html>.
> <URL:http://www.cs.ust.hk/faculty/dimitris/COMP530/notes/l7b.pdf>?

> Vinzent.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2003-07-31 15:22 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-29 13:10 sorting large numbers of large records Brien L. Christesen
2003-07-29 14:30 ` Larry Kilgallen
2003-07-30  0:32 ` Keith Thompson
2003-07-30  1:53   ` Hyman Rosen
2003-07-30 14:55     ` Matthew Heaney
2003-07-30 16:41       ` Chad R. Meiners
  -- strict thread matches above, loose matches on Subject: below --
2003-07-28 15:29 Brien L. Christesen
2003-07-28 15:35 ` Vinzent Hoefler
2003-07-31 15:22   ` Brien L. Christesen
2003-07-28 16:25 ` Hyman Rosen
2003-07-28 20:30 ` John R. Strohm
2003-07-28 20:52   ` Hyman Rosen
2003-07-28 23:47     ` Matthew Heaney
2003-07-28 23:33 ` Matthew Heaney
2003-07-28 23:43 ` Matthew Heaney
2003-07-29  0:42 ` John Cupak
2003-07-29  3:38   ` Matthew Heaney
2003-07-29  8:32   ` Preben Randhol

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox