Efficient io of arbitrary binary data.

comp.lang.ada
 help / color / mirror / Atom feed

* Efficient io of arbitrary binary data.
@ 1996-09-13  0:00 Brian R. Hanson
  1996-09-14  0:00 ` Larry Kilgallen
  1996-09-14  0:00 ` Larry Kilgallen
  0 siblings, 2 replies; 10+ messages in thread
From: Brian R. Hanson @ 1996-09-13  0:00 UTC (permalink / raw)



I recently had to replace a merge/sort program written in terible
fortran with a new implimentation written in c.

The data being sorted is variable length binary records.  The 
sort reads as many records as fit into some large buffer sorts them
and writes the sorted data to a file in large blocks.  THe blocks
are generated so that no record spans a block and the size of the block
is chosen to be efficiently read and written by the os.

Once the initial sort pass is complete, the file now has some number
of sorted regions (built of these blocks which it merges in log2(n)
passes.

Using c I was able to read the blocks (asynchronously) and merge
the data from the input buffers directly to the output buffer.
The buffer management routines returned references directly the the
strings in the buffers which could be compared and the appropriate
one moved.

I considering how this program could be written in Ada (part of 
an attempt to become Ada literate in an Ada hostile environment)
I a puzzled.  The approaches which Ada seems to allow all require 
much more copying of data as I am not allowed to return a reference
to a slice of an array I can only return the slice itself.

In c, the approach to building the block is to treat the block as
an array of char and an array of int.  The record data is written
from the begining of the block and the record lengths are written 
from the end of the block.  Records are stored until a record long
enough to cause the length and data to overlap is encountered at
which time the block is written and the record is stored in the 
new block instead.  (having the length information and data grow 
toward the middle allows both to be used naturally without worrying
about alignment).  When the blocks are being read, the routine to
get the next record keeps returning the location and size of the 
next record in the block until the block is exhausted when it starts
on the next.

Writing the block is not hard in Ada with a little help from 
unchecked_conversion.  However, reading the records seems to require
that the data be returned from the block reader rather than a 
reference to the data.

Is this true or there a much better approach to solving this problem
of efficient io of arbitrary binary records?

-- Brian Hanson
-- brh@cray.com




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient io of arbitrary binary data.
  1996-09-13  0:00 Efficient io of arbitrary binary data Brian R. Hanson
@ 1996-09-14  0:00 ` Larry Kilgallen
  1996-09-16  0:00   ` Brian Hanson
  1996-09-14  0:00 ` Larry Kilgallen
  1 sibling, 1 reply; 10+ messages in thread
From: Larry Kilgallen @ 1996-09-14  0:00 UTC (permalink / raw)

In article <3239B3B2.1AE4@cray.com>, "Brian R. Hanson" <brh@cray.com> writes:

> I a puzzled.  The approaches which Ada seems to allow all require 
> much more copying of data as I am not allowed to return a reference
> to a slice of an array I can only return the slice itself.

Presuming the concern is runtime efficiency rather than code inspection
efficiency :-), whether you return a slice or a reference to a slice
may be immaterial. A compiler with aggressive optimization may generate
inline code for your whole "read" function, making the object code
for the two methods equivalent.  When programming in Ada you have
given the compiler more information than programming in C, so it has
more opportunity to optimize.  Based on your return address, I would
presume folks in your shop are big on optimizing compilers.

Larry Kilgallen

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient io of arbitrary binary data.
  1996-09-14  0:00 ` Larry Kilgallen
@ 1996-09-16  0:00   ` Brian Hanson
  1996-09-16  0:00     ` Stephen Leake
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Brian Hanson @ 1996-09-16  0:00 UTC (permalink / raw)

Larry Kilgallen) writes:

   In article <3239B3B2.1AE4@cray.com>, "Brian R. Hanson" <brh@cray.com> writes:

   > I a puzzled.  The approaches which Ada seems to allow all require 
   > much more copying of data as I am not allowed to return a reference
   > to a slice of an array I can only return the slice itself.

   Presuming the concern is runtime efficiency rather than code inspection
   efficiency :-), whether you return a slice or a reference to a slice
   may be immaterial. A compiler with aggressive optimization may generate
   inline code for your whole "read" function, making the object code
   for the two methods equivalent.  When programming in Ada you have
   given the compiler more information than programming in C, so it has
   more opportunity to optimize.  Based on your return address, I would
   presume folks in your shop are big on optimizing compilers.

Actually, folks in our shop are big on optimizing c, c++ and fortran compilers.
My target platforms - sgi (our new parent) and sun (our workstation of 
choice prior to acquiring a parent) - use gnat.

is it really likely that 

	case compare_keys(current_string(buf1), current_string(buf2)) is
	  when smaller, the_same =>
		store_string(buf3, current_string(buf1));
		advance_string(buf1);
	  when larger =>
		store_string(buf3, current_string(buf2));
		advance_string(buf2);
	end case;

would really optimize the copying of the string slices away.

   package buffers is
     type buffer_type is private;
     procedure advance_string(buf: in out buffer_type);
     function current_string(buf: in buffer_type) return string;
     procedure store_string(buf: in out buffer_type, data: in string);
   private 
     type buffer_type is record 
       data: string(65536);
       start: positive;
       length: positive;
       -- other details omitted.
     end record;
   end buffers;

   package body buffers is
     -- details omitted.

     function current_string(buf: in buffer_type) return string is
     begin
	return buf.data(buf.start, buf.length);
     end;
   end buffers;

Brian Hanson
--
-- Brian Hanson
-- brh@cray.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient io of arbitrary binary data.
  1996-09-16  0:00   ` Brian Hanson
@ 1996-09-16  0:00     ` Stephen Leake
  1996-09-16  0:00     ` Larry Kilgallen
  1996-09-16  0:00     ` Robert A Duff
  2 siblings, 0 replies; 10+ messages in thread
From: Stephen Leake @ 1996-09-16  0:00 UTC (permalink / raw)



Brian Hanson wrote:
> [discussion of optimizers snipped] 
> is it really likely that
> 
>         case compare_keys(current_string(buf1), current_string(buf2)) is
>           when smaller, the_same =>
>                 store_string(buf3, current_string(buf1));
>                 advance_string(buf1);
>           when larger =>
>                 store_string(buf3, current_string(buf2));
>                 advance_string(buf2);
>         end case;
> 
> 
>    package buffers is
>      type buffer_type is private;
>      procedure advance_string(buf: in out buffer_type);
>      function current_string(buf: in buffer_type) return string;
>      procedure store_string(buf: in out buffer_type, data: in string);
>    private
>      type buffer_type is record
>        data: string(65536);
>        start: positive;
>        length: positive;
>        -- other details omitted.
>      end record;
>    end buffers;
> 
>    package body buffers is
>      -- details omitted.
> 
>      function current_string(buf: in buffer_type) return string is
>      begin
>         return buf.data(buf.start, buf.length);
>      end;
>    end buffers;
> 

Since you are trying to keep only one copy of each string, but need to 
pass access values for it, I would suggest you define your own string 
type:

type String_Access is access String;
type Fast_String is record
    Buffer : String_Access;
    First : NATURAL;
    Length : NATURAL; -- or maybe last?
end record;

then define all your functions with this type:

function current_string(buf: in buffer_type) return Fast_String is
begin
    return (buf.data'access, (buf.start, buf.length);
end;

procedure store_string (To : in buffer_type; from : in fast_string) is
begin
    To.data (1 .. from.length) := 
        from.buffer (from.first .. from.first + from.length - 1);
end;

This way, the string is never copied until you need to move it.

warning: I have NOT run this code thru a compiler, so typos are 
probably present.

> Brian Hanson
> --
> -- Brian Hanson
> -- brh@cray.com

-- 
- Stephe




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient io of arbitrary binary data.
  1996-09-16  0:00   ` Brian Hanson
  1996-09-16  0:00     ` Stephen Leake
@ 1996-09-16  0:00     ` Larry Kilgallen
  1996-09-16  0:00     ` Robert A Duff
  2 siblings, 0 replies; 10+ messages in thread
From: Larry Kilgallen @ 1996-09-16  0:00 UTC (permalink / raw)



In article <BRH.96Sep16002020@poplar111.cray.com>, brh@poplar111.cray.com (Brian Hanson) writes:

> Actually, folks in our shop are big on optimizing c, c++ and fortran compilers.
> My target platforms - sgi (our new parent) and sun (our workstation of 
> choice prior to acquiring a parent) - use gnat.
> 
> is it really likely that 
> 
> 
> 	case compare_keys(current_string(buf1), current_string(buf2)) is
> 	  when smaller, the_same =>
> 		store_string(buf3, current_string(buf1));
> 		advance_string(buf1);
> 	  when larger =>
> 		store_string(buf3, current_string(buf2));
> 		advance_string(buf2);
> 	end case;
> 
> would really optimize the copying of the string slices away.

As has been pointed out on this list, one of the advantages of using
GNAT is that if you feel it is inadequate in some regard, you can
extend it.  Optimization seems one of the best areas for extension,
since you are not doing something which will conflict with whatever
the next standard might be.

Larry Kilgallen




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient io of arbitrary binary data.
  1996-09-16  0:00   ` Brian Hanson
  1996-09-16  0:00     ` Stephen Leake
  1996-09-16  0:00     ` Larry Kilgallen
@ 1996-09-16  0:00     ` Robert A Duff
  2 siblings, 0 replies; 10+ messages in thread
From: Robert A Duff @ 1996-09-16  0:00 UTC (permalink / raw)



In article <BRH.96Sep16002020@poplar111.cray.com>,
Brian Hanson <brh@poplar111.cray.com> wrote:
[stuff about sorting slices of strings]

I suggest you define a private type to represent your string-slice
pointers.  Each string pointer is represented as a low and high bound,
or low bound and length, or whatever, relative to the buffer.  You can
even use an access type pointing to the first character, plus a length,
if you make the buffer components aliased.

Then write your sorting routine in terms of this private data type.

Make sure the buffer is immutable (that is, don't allow modifications
while string-slice-pointers are in use).

- Bob




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient io of arbitrary binary data.
  1996-09-13  0:00 Efficient io of arbitrary binary data Brian R. Hanson
  1996-09-14  0:00 ` Larry Kilgallen
@ 1996-09-14  0:00 ` Larry Kilgallen
  1996-09-14  0:00   ` Robert Dewar
                     ` (2 more replies)
  1 sibling, 3 replies; 10+ messages in thread
From: Larry Kilgallen @ 1996-09-14  0:00 UTC (permalink / raw)

In article <3239B3B2.1AE4@cray.com>, "Brian R. Hanson" <brh@cray.com> writes:
> I recently had to replace a merge/sort program written in terible
> fortran with a new implimentation written in c.
> 
> The data being sorted is variable length binary records.  The 
> sort reads as many records as fit into some large buffer sorts them
> and writes the sorted data to a file in large blocks.  THe blocks
> are generated so that no record spans a block and the size of the block
> is chosen to be efficiently read and written by the os.

Given that many constraints, you are bound to do some individualized
fiddling, which is the basic operational mode for C.  Such activities
are possible in Ada, but not using the higher level constructs which
make the language popular with its adherents.

> I considering how this program could be written in Ada (part of 
> an attempt to become Ada literate in an Ada hostile environment)
> I a puzzled.  The approaches which Ada seems to allow all require 
> much more copying of data as I am not allowed to return a reference
> to a slice of an array I can only return the slice itself.

Given that your input is performed in some os-specific fashion anyway
(your approach seems to indicate you would not want extra os buffering),
you could return an access to the input buffer (array) and indices for the
start and ending bytes of a particular record.  Copy the data when you
are ready to ship the record out.  Given a high-quality optimizing
compiler, a tight loop to move the bytes should not be any less efficient
than having it done by a run-time-library.

Larry Kilgallen

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient io of arbitrary binary data.
  1996-09-14  0:00 ` Larry Kilgallen
@ 1996-09-14  0:00   ` Robert Dewar
  1996-09-16  0:00   ` Brian Hanson
  1996-09-17  0:00   ` Ted Dennison
  2 siblings, 0 replies; 10+ messages in thread
From: Robert Dewar @ 1996-09-14  0:00 UTC (permalink / raw)



Brian said

"> I considering how this program could be written in Ada (part of
> an attempt to become Ada literate in an Ada hostile environment)
> I a puzzled.  The approaches which Ada seems to allow all require
> much more copying of data as I am not allowed to return a reference
> to a slice of an array I can only return the slice itself."

Well seeing as you have neither slices nor references to slices in C,
it is hardly possible that this limitation is a significant one! DOn't
assume that because you are writing in Ada, you *have* to use all its
features!





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient io of arbitrary binary data.
  1996-09-14  0:00 ` Larry Kilgallen
  1996-09-14  0:00   ` Robert Dewar
@ 1996-09-16  0:00   ` Brian Hanson
  1996-09-17  0:00   ` Ted Dennison
  2 siblings, 0 replies; 10+ messages in thread
From: Brian Hanson @ 1996-09-16  0:00 UTC (permalink / raw)

   Brian said

   "> I considering how this program could be written in Ada (part of
   > an attempt to become Ada literate in an Ada hostile environment)
   > I a puzzled.  The approaches which Ada seems to allow all require
   > much more copying of data as I am not allowed to return a reference
   > to a slice of an array I can only return the slice itself."

Robert Dewar writes:
   Well seeing as you have neither slices nor references to slices in C,
   it is hardly possible that this limitation is a significant one! DOn't
   assume that because you are writing in Ada, you *have* to use all its
   features!

C may not have slices but since a pointer to a char can be considered an
a array, I have the equivalent to a slice by carrying around the length
seperately.  I can use memcmp and memcpy to compare and move this block
of data.  I would expect that most compilers turn these library calls 
into inline code.

I had hopes of packaging up the buffer management code so that the sort
code just dealt with string references.  A language that I used for many
years (cybil - a descendent of pascal used as a system language on the
Control Data 180 NOS/ve systems) provided two usefull abstractions - the
heap and sequence.  The heap was used to provide a capability vaguely 
similar to Ada's storage pools.  Sequences provided a way to access a
block of memory in a way that did not violate type checking.  References 
to objects could be "next"ed from the heap stack like.  In, cybil, 
writing this sort program would be very straight forward with the full
assistance of the compiler.  It is my experience with cybil that 
attracts me to Ada and makes writing c painful.

So the question is not whether I can translate the code line for line
from c to Ada generating a c program in Ada but what are the Ada 
idioms for dealing with this sort of program?

Brian Hanson
--
-- Brian Hanson
-- brh@cray.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Efficient io of arbitrary binary data.
  1996-09-14  0:00 ` Larry Kilgallen
  1996-09-14  0:00   ` Robert Dewar
  1996-09-16  0:00   ` Brian Hanson
@ 1996-09-17  0:00   ` Ted Dennison
  2 siblings, 0 replies; 10+ messages in thread
From: Ted Dennison @ 1996-09-17  0:00 UTC (permalink / raw)



Brian Hanson wrote:
> 
>    Brian said
> 
>    "> I considering how this program could be written in Ada (part of
>    > an attempt to become Ada literate in an Ada hostile environment)
>    > I a puzzled.  The approaches which Ada seems to allow all require
>    > much more copying of data as I am not allowed to return a reference
>    > to a slice of an array I can only return the slice itself."
> 
> Robert Dewar writes:
>    Well seeing as you have neither slices nor references to slices in C,
>    it is hardly possible that this limitation is a significant one! DOn't
>    assume that because you are writing in Ada, you *have* to use all its
>    features!
> 
> C may not have slices but since a pointer to a char can be considered an
> a array, I have the equivalent to a slice by carrying around the length
> seperately.  I can use memcmp and memcpy to compare and move this block
> of data.  I would expect that most compilers turn these library calls
> into inline code.

The Ada equivalent to THAT would be passing around access types, and using
UNCHECKED_CONVERSION to hose the source object types into other access types
(assuming a typical implementation of access types - check your appendix F). 
This is, of course, to be avioded.

A slightly better method is to declare a "buffer" type which is an array
of byte (or integer, whatever's convienent)-sized objects. Eventually you 
will probably copy data out of your reciept buffer for client use. At THAT
point, do the UNCHECKED_CONVERSION. It won't be any extra code, if you are
already copying anyway.

-- 
T.E.D.          
                |  Work - mailto:dennison@escmail.orl.mmc.com  |
                |  Home - mailto:dennison@iag.net              |
                |  URL  - http://www.iag.net/~dennison         |




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~1996-09-17  0:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1996-09-13  0:00 Efficient io of arbitrary binary data Brian R. Hanson
1996-09-14  0:00 ` Larry Kilgallen
1996-09-16  0:00   ` Brian Hanson
1996-09-16  0:00     ` Stephen Leake
1996-09-16  0:00     ` Larry Kilgallen
1996-09-16  0:00     ` Robert A Duff
1996-09-14  0:00 ` Larry Kilgallen
1996-09-14  0:00   ` Robert Dewar
1996-09-16  0:00   ` Brian Hanson
1996-09-17  0:00   ` Ted Dennison

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox