From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,cf34599caf2fa938
X-Google-Attributes: gid103376,public
From: ncohen@watson.ibm.com (Norman H. Cohen)
Subject: Re: GNAT function calling overhead
Date: 1995/04/06
Message-ID: <3m1kbo$174b@watnews1.watson.ibm.com>
X-Deja-AN: 100939415
distribution: world
references: <3m0nv1$pv2@nef.ens.fr>
organization: IBM T.J. Watson Research Center
reply-to: ncohen@watson.ibm.com
newsgroups: comp.lang.ada
Date: 1995-04-06T00:00:00+00:00
List-Id: <comp.lang.ada>

In article <3m0nv1$pv2@nef.ens.fr>, sands@clipper.ens.fr (Duncan Sands) writes: 
|> system: DOS 486 (with math coprocessor), gcc 2.6.3, GNAT 2.03
|>
|> Essentially the question is: why so much function calling overhead
|> in GNAT?
|>
|> I'm writing a set of matrix factorization routines (Schur etc) so
|> of course I need routines for multiplying matrices etc.
|> For me a matrix is
|>    type Matrix is array(Positive range <>, Positive range <>) of Float;
|>
|> I overloaded "*" for matrix multiplication: 
|>    function  "*"(Left : in Matrix; Right : in Matrix) return Matrix;

Functions returning results of an unconstrained array type are
notoriously expensive.  Because the compiler cannot determine the size of
the result before hand, it cannot leave space for it in an ordinary stack
frame, so the result must be put somewhere else and then copied to its
final destination after the function returns.  Unless your compiler is
clever enough to realize that your local variable inside the function,
say Result, is the only variable used in a return statement, it will
probably copy twice: from Result to "somewhere else" and from "somewhere
else" to the variable in the calling subprogram to which you assign the
function result.

Here are some other experiments you could try: 

   1. Use a procedure

        procedure Multiply (Left, Right: in Matrix; Product: out Matrix);

      instead of a function.  (This makes the caller responsible for
      knowing the size of the result and declaring an object with those
      dimensions to be passed as the third actual parameter.)

   2. (Poor SW engineering, but a worthwhile experiment:)  Restrict your
      function to work with a CONSTRAINED array subtype: 

         subtype Matrix_15 is Matrix (1 .. 15, 1 .. 15);
         function  "*"
            (Left : in Matrix_15; Right : in Matrix_15) return Matrix_15;

|> Multiplying two 15 by 15 matrices 10_000 times using this function
|> takes about 55 seconds on my machine.  The algorithm is the obvious
|> one: loop over rows and columns, add up the appropriate products and
|> assign them.
|>
|> I then "inlined" this function: rather than using "*", I put the code
|> for "*" directly into my 10_000 times loop, of course renaming Left
|> and Right to the names of my matrices, and assigning directly to the
|> matrix which is to hold the answer.  In this way I eliminated the
|> function calling overhead.  Using this method, multiplying two 15 by
|> 15 matrices 10_000 times takes about 44 seconds.

If you wrote

    for I in Left'Range(1) loop
       for J in Right'Range(2) loop
          for K in Left'Range(2) loop
             Ultimate_Target(I,J) :=
                Ultimate_Target(I,J) + Left(I,K) * Right(K,J);
          end loop;
       end loop;
    end loop;

rather than

    for I in Left'Range(1) loop
       for J in Right'Range(2) loop
          for K in Left'Range(2) loop
             Result(I,J) := Result(I,J) + Left(I,K) * Right(K,J);
          end loop;
       end loop;
    end loop;
    Ultimate_Target := Result;

(as seems sensible) then you eliminated more than the function-call
overhead:  You also eliminated the overhead that was originally
associated with the ":=" in

   Ultimate_Result := Left * Right;

|> All this was done with optimisation (-O3) and -gnatp (i.e. no range
|> checking etc).
|>
|> In summary: 55 seconds with function calling overhead.
|>             44 seconds without function calling overhead.
|>
|> Now, a 15 by 15 matrix means 225 entries.  225 entries at,
|> say, 8 bytes an entry makes a bit less than 2k.  So, depending on
|> whether GNAT takes function parameters by reference or by copy,
|> this makes anything between 2k and, say, 10k bytes to be copied
|> on each function call.
|>
|> Does this explain the time difference?  It shouldn't!  The amount
|> of time spent copying memory should be completely overwhelmed by
|> the amount of time taken to do the floating point operations!

As modern processors have become faster and faster, loads and stores have
become the bottleneck in many computations.  I don't know the details of
timings on the 486, but on the PowerPC architecture, once you fill your
floating-point pipeline with multiply-adds the way that the inner loop
above does, you get one righthand side expression completely evaluated on
each cycle, PROVIDED that you can get your operands into floating-point
registers fast enough.  ("Floating-point registers?" ask the Intel users,
"What are floating-point registers?") Accounting for leading-edge and
trailing-edge effects, the fifteen iterations of the inner loop could
take on the order of 20-25 cycles.  In contrast, a single load of a value
not in cache could cost you on the order of 10 cycles.  (Once you pay
that penalty, an entire cache line is read in, which, assuming row-major
order, buys you something as you traverse a row of Left, but not as you
traverse a column of Right.)

If you get unlucky, you find that parts of different matrices you are
touching at about the same time (say row I of Result and Row I of Left),
or different parts of the same matrix that you are touching at about the
same time (say Right(K,J) and Right(K+1,J)) are mapped to the same cache
line.  Then, unless you have a highly associative cache, you encounter an
inordinate number of cache misses and slow your computation down
dramatically.

Memory latencies play such an important role in the running time of
numerical algorithms that professional matrix multipliers almost never
use the familiar "for i/for j/for k" loops that I wrote above.  We are
more likely to see something like

   for K in Left'Range(2) loop
      for I in Left'Range(1) loop
         for J in Right'Range(2) loop
            Result(I,J) := Result(I,J) + Left(I,K) * Right(K,J);
         end loop;
      end loop;
   end loop;

or, better yet, if it is convenient to keep the matrices that will be
used as left operands in transposed form,

   Result(I,J) := Result(I,J) + Transpose_Of_Left(K,I) * Right(K,J);

which (again assuming row-major ordering of array components) does a much
better job of reusing the contents of cache lines once they have been
loaded from memory and the contents of registers once they have been
loaded from cache.  It is because these effects are so powerful that
Fortran preprocessors performing these kinds of transformations are able
to increase certain SPECfp benchmark scores by orders of magnitude.

|> That is, for each of the 225 entries there are 15 floating point
|> multiplications to be performed.  The amount of time taken to
|> copy the 225 entries, even if you need to copy them several times,
|> should be MUCH smaller than the amount of time spent in the
|> calculation.  But the timings above indicate that function
|> calling overhead makes up something like 25% of the time taken!

Well, 20% (11/55), but in any event I'm not surprised.  Adding and
multiplying floating-point numbers is the easy part.  It's copying them
that can slow you down.

--
Norman H. Cohen    ncohen@watson.ibm.com