From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,14bff0642983a2a5
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2003-07-28 13:45:05 PST
Path: 
 archiver1.google.com!news1.google.com!newsfeed.stanford.edu!logbridge.uoregon.edu!news.maxwell.syr.edu!news.airnews.net!cabal12.airnews.net!usenet
From: "John R. Strohm" <strohm@airmail.net>
Newsgroups: comp.lang.ada
Subject: Re: sorting large numbers of large records
Date: Mon, 28 Jul 2003 15:30:43 -0500
Organization: Airnews.net! at Internet America
Message-ID: <bg41uo$9th@library2.airnews.net>
References: <bg3fgg$qig$1@grapevine.wam.umd.edu>
Abuse-Reports-To: abuse at airmail.net to report improper postings
NNTP-Proxy-Relay: library2.airnews.net
NNTP-Posting-Time: Mon, 28 Jul 2003 15:44:08 -0500 (CDT)
NNTP-Posting-Host: ![k5u1k-W2kKF4R (Encoded at Airnews!)
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2800.1106
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
Xref: archiver1.google.com comp.lang.ada:40908
Date: 2003-07-28T15:30:43-05:00
List-Id: <comp.lang.ada>

"Brien L. Christesen" <blchrist@rac2.wam.umd.edu> wrote in message
news:bg3fgg$qig$1@grapevine.wam.umd.edu...
>
>
> First off, I'm an Ada newbie, I've only been using it for about a month
> now.
>
> I am trying to sort ~1,000,000 (with potential to grow to 60 mil)
> records of 256 bytes each in Ada. The records are written to a binary
> file. Right now, I read in the file, and divide it into files greater
> than and less than a pivot value.  I then recursively call this division
> function until it reaches a set less than 100,000 records (if I make this
> number any larger I get a stack overflow on my windows box) it creates an
> array of the records, calls quicksort on it, and appends the result to a
> final results file.
>
> I couldn't really find much discussion of doing external sorts in Ada (or
> much literature on external sorts at all) so I'm not sure if this is a
> good approach to be taking, or if there is a better way to do the sort.  I
> tried creating an index, sorting the index, and then creating the results
> file based on the index.  Jumping around the file with direct_io led to
> awful results, though.  The execution time of the part of the code that
> read the index, got the corresponding value from the large record file,
> and wrote that record to a new file grew exponentially.
>
> Does anyone know of a good way to do this kind of sort?  Thanks in advance
> for any responses.

The canonical reference on sorting is Knuth, Vol. 3, "Sorting and
Searching".

At 1 million records of 256 bytes each, you can probably do an in-core sort.
At 60 million records, you probably can't.  Plan on sorting chunks and then
merging the sorted chunks.

Making an index and sorting that is the second key to sort performance,
after choosing the best sorting algorithm you can.

Incidentally, depending on your system, it might be to your advantage to
maintain TWO datafiles: one big one that is sorted, and one little one that
isn't, but contains things that have been modified recently.  Whenever you
go to modify a record, you first search the little file, to see if it has
already been modified.  If it has, it will be in the little file, and you
just use that one.  If not, you copy it out of the big file, edit it, and
put it in the little file.  When it becomes necessary to have the whole file
sorted, you sort the little file and then merge the two, removing duplicate
(old) records.