From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,INVALID_MSGID
	autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,7402728c011ea87a,start
X-Google-Attributes: gid103376,public
From: "Brian R. Hanson" <brh@cray.com>
Subject: Efficient io of arbitrary binary data.
Date: 1996/09/13
Message-ID: <3239B3B2.1AE4@cray.com>#1/1
X-Deja-AN: 180430782
content-type: text/plain; charset=us-ascii
organization: Cray Research a division of Silicon Graphics, Inc.
mime-version: 1.0
newsgroups: comp.lang.ada
x-mailer: Mozilla 3.0b7 (X11; I; SunOS 5.4 sun4m)
Date: 1996-09-13T00:00:00+00:00
List-Id: <comp.lang.ada>


I recently had to replace a merge/sort program written in terible
fortran with a new implimentation written in c.

The data being sorted is variable length binary records.  The 
sort reads as many records as fit into some large buffer sorts them
and writes the sorted data to a file in large blocks.  THe blocks
are generated so that no record spans a block and the size of the block
is chosen to be efficiently read and written by the os.

Once the initial sort pass is complete, the file now has some number
of sorted regions (built of these blocks which it merges in log2(n)
passes.

Using c I was able to read the blocks (asynchronously) and merge
the data from the input buffers directly to the output buffer.
The buffer management routines returned references directly the the
strings in the buffers which could be compared and the appropriate
one moved.

I considering how this program could be written in Ada (part of 
an attempt to become Ada literate in an Ada hostile environment)
I a puzzled.  The approaches which Ada seems to allow all require 
much more copying of data as I am not allowed to return a reference
to a slice of an array I can only return the slice itself.

In c, the approach to building the block is to treat the block as
an array of char and an array of int.  The record data is written
from the begining of the block and the record lengths are written 
from the end of the block.  Records are stored until a record long
enough to cause the length and data to overlap is encountered at
which time the block is written and the record is stored in the 
new block instead.  (having the length information and data grow 
toward the middle allows both to be used naturally without worrying
about alignment).  When the blocks are being read, the routine to
get the next record keeps returning the location and size of the 
next record in the block until the block is exhausted when it starts
on the next.

Writing the block is not hard in Ada with a little help from 
unchecked_conversion.  However, reading the records seems to require
that the data be returned from the block reader rather than a 
reference to the data.

Is this true or there a much better approach to solving this problem
of efficient io of arbitrary binary records?

-- Brian Hanson
-- brh@cray.com