comp.lang.ada
 help / color / mirror / Atom feed
From: Keith Thompson <kst@cts.com>
Subject: Re: sorting large numbers of large records
Date: Wed, 30 Jul 2003 00:32:16 GMT
Date: 2003-07-30T00:32:16+00:00	[thread overview]
Message-ID: <yecispkluok.fsf@king.cts.com> (raw)
In-Reply-To: bg5rol$rnp$1@grapevine.wam.umd.edu

"Brien L. Christesen" <blchrist@rac1.wam.umd.edu> writes:
> That is a good point, and I looked into the unix sort command.  The only
> problem is that as far as I can tell, that only sorts text files.  I have
> a file of binary records, so I have no idea how I could use a system sort
> command to do it.  Is there any way that would work?

Translate your binary records into a sortable text format, one record
per line, sort the resulting text file, and translate back into your
binary format.

As long as a line doesn't contain NUL or linefeed characters, you
should be ok.  If you're using GNU sort, the length of each line can
be unlimited; for a non-GNU Unix sort program, there may be some
limit, but it's probably at least 1024 characters.  If you don't have
GNU sort, consider installing it; it's part of the GNU coreutils
package at <ftp://ftp.gnu.org/gnu/coreutils/>.

Put the sort key in a fixed-width field at the beginning of the line,
in a form that can be sorted by a simple string comparison (Unix sort
can do numeric comparisons, but they're slower).  You can probably
make the other fields fixed-width or variable-width, whichever is more
convenient.

Don't worry too much about making the text format human-readable.
Consider using hexadecimal rather than decimal for integer fields; it
sorts just as well (when treated as strings) and may be cheaper to
convert.  You can even use raw hexadecimal for floating-point and
other binary fields (convert the representation, not the value), as
long as you're not using them as part of the sort key; the only
requirement is that you're able to recover the binary values from the
strings.        

Carefully read the man page for the sort command on your system.  If
your program needs to be portable to multiple Unix-like systems, read
the man page on all of them.  Pay attention to any limitations on line
length.  Use whatever options are needed to turn off any
locale-specific behavior; you need raw bytewise ASCII collation, not
something that knows about accented letters.  For GNU sort, set the
environment variable $LC_ALL to "C".

Finally, I'm not sure what GNU sort (or any other Unix-like sort) does
with input too big to fit into memory; read the documentation and/or
experiment to make sure it fits your needs.  I know that GNU sort has
an option to tell it where to put temporary files, so apparently it
uses temporary files somehow.

-- 
Keith Thompson (The_Other_Keith) kst@cts.com  <http://www.ghoti.net/~kst>
San Diego Supercomputer Center           <*>  <http://www.sdsc.edu/~kst>
Schroedinger does Shakespeare: "To be *and* not to be"



  parent reply	other threads:[~2003-07-30  0:32 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-07-29 13:10 sorting large numbers of large records Brien L. Christesen
2003-07-29 14:30 ` Larry Kilgallen
2003-07-30  0:32 ` Keith Thompson [this message]
2003-07-30  1:53   ` Hyman Rosen
2003-07-30 14:55     ` Matthew Heaney
2003-07-30 16:41       ` Chad R. Meiners
  -- strict thread matches above, loose matches on Subject: below --
2003-07-28 15:29 Brien L. Christesen
2003-07-28 15:35 ` Vinzent Hoefler
2003-07-31 15:22   ` Brien L. Christesen
2003-07-28 16:25 ` Hyman Rosen
2003-07-28 20:30 ` John R. Strohm
2003-07-28 20:52   ` Hyman Rosen
2003-07-28 23:47     ` Matthew Heaney
2003-07-28 23:33 ` Matthew Heaney
2003-07-28 23:43 ` Matthew Heaney
2003-07-29  0:42 ` John Cupak
2003-07-29  3:38   ` Matthew Heaney
2003-07-29  8:32   ` Preben Randhol
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox