From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00, REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,cf34599caf2fa938 X-Google-Attributes: gid103376,public From: ncohen@watson.ibm.com (Norman H. Cohen) Subject: Re: GNAT function calling overhead Date: 1995/04/06 Message-ID: <3m1kbo$174b@watnews1.watson.ibm.com> X-Deja-AN: 100939415 distribution: world references: <3m0nv1$pv2@nef.ens.fr> organization: IBM T.J. Watson Research Center reply-to: ncohen@watson.ibm.com newsgroups: comp.lang.ada Date: 1995-04-06T00:00:00+00:00 List-Id: In article <3m0nv1$pv2@nef.ens.fr>, sands@clipper.ens.fr (Duncan Sands) writes: |> system: DOS 486 (with math coprocessor), gcc 2.6.3, GNAT 2.03 |> |> Essentially the question is: why so much function calling overhead |> in GNAT? |> |> I'm writing a set of matrix factorization routines (Schur etc) so |> of course I need routines for multiplying matrices etc. |> For me a matrix is |> type Matrix is array(Positive range <>, Positive range <>) of Float; |> |> I overloaded "*" for matrix multiplication: |> function "*"(Left : in Matrix; Right : in Matrix) return Matrix; Functions returning results of an unconstrained array type are notoriously expensive. Because the compiler cannot determine the size of the result before hand, it cannot leave space for it in an ordinary stack frame, so the result must be put somewhere else and then copied to its final destination after the function returns. Unless your compiler is clever enough to realize that your local variable inside the function, say Result, is the only variable used in a return statement, it will probably copy twice: from Result to "somewhere else" and from "somewhere else" to the variable in the calling subprogram to which you assign the function result. Here are some other experiments you could try: 1. Use a procedure procedure Multiply (Left, Right: in Matrix; Product: out Matrix); instead of a function. (This makes the caller responsible for knowing the size of the result and declaring an object with those dimensions to be passed as the third actual parameter.) 2. (Poor SW engineering, but a worthwhile experiment:) Restrict your function to work with a CONSTRAINED array subtype: subtype Matrix_15 is Matrix (1 .. 15, 1 .. 15); function "*" (Left : in Matrix_15; Right : in Matrix_15) return Matrix_15; |> Multiplying two 15 by 15 matrices 10_000 times using this function |> takes about 55 seconds on my machine. The algorithm is the obvious |> one: loop over rows and columns, add up the appropriate products and |> assign them. |> |> I then "inlined" this function: rather than using "*", I put the code |> for "*" directly into my 10_000 times loop, of course renaming Left |> and Right to the names of my matrices, and assigning directly to the |> matrix which is to hold the answer. In this way I eliminated the |> function calling overhead. Using this method, multiplying two 15 by |> 15 matrices 10_000 times takes about 44 seconds. If you wrote for I in Left'Range(1) loop for J in Right'Range(2) loop for K in Left'Range(2) loop Ultimate_Target(I,J) := Ultimate_Target(I,J) + Left(I,K) * Right(K,J); end loop; end loop; end loop; rather than for I in Left'Range(1) loop for J in Right'Range(2) loop for K in Left'Range(2) loop Result(I,J) := Result(I,J) + Left(I,K) * Right(K,J); end loop; end loop; end loop; Ultimate_Target := Result; (as seems sensible) then you eliminated more than the function-call overhead: You also eliminated the overhead that was originally associated with the ":=" in Ultimate_Result := Left * Right; |> All this was done with optimisation (-O3) and -gnatp (i.e. no range |> checking etc). |> |> In summary: 55 seconds with function calling overhead. |> 44 seconds without function calling overhead. |> |> Now, a 15 by 15 matrix means 225 entries. 225 entries at, |> say, 8 bytes an entry makes a bit less than 2k. So, depending on |> whether GNAT takes function parameters by reference or by copy, |> this makes anything between 2k and, say, 10k bytes to be copied |> on each function call. |> |> Does this explain the time difference? It shouldn't! The amount |> of time spent copying memory should be completely overwhelmed by |> the amount of time taken to do the floating point operations! As modern processors have become faster and faster, loads and stores have become the bottleneck in many computations. I don't know the details of timings on the 486, but on the PowerPC architecture, once you fill your floating-point pipeline with multiply-adds the way that the inner loop above does, you get one righthand side expression completely evaluated on each cycle, PROVIDED that you can get your operands into floating-point registers fast enough. ("Floating-point registers?" ask the Intel users, "What are floating-point registers?") Accounting for leading-edge and trailing-edge effects, the fifteen iterations of the inner loop could take on the order of 20-25 cycles. In contrast, a single load of a value not in cache could cost you on the order of 10 cycles. (Once you pay that penalty, an entire cache line is read in, which, assuming row-major order, buys you something as you traverse a row of Left, but not as you traverse a column of Right.) If you get unlucky, you find that parts of different matrices you are touching at about the same time (say row I of Result and Row I of Left), or different parts of the same matrix that you are touching at about the same time (say Right(K,J) and Right(K+1,J)) are mapped to the same cache line. Then, unless you have a highly associative cache, you encounter an inordinate number of cache misses and slow your computation down dramatically. Memory latencies play such an important role in the running time of numerical algorithms that professional matrix multipliers almost never use the familiar "for i/for j/for k" loops that I wrote above. We are more likely to see something like for K in Left'Range(2) loop for I in Left'Range(1) loop for J in Right'Range(2) loop Result(I,J) := Result(I,J) + Left(I,K) * Right(K,J); end loop; end loop; end loop; or, better yet, if it is convenient to keep the matrices that will be used as left operands in transposed form, Result(I,J) := Result(I,J) + Transpose_Of_Left(K,I) * Right(K,J); which (again assuming row-major ordering of array components) does a much better job of reusing the contents of cache lines once they have been loaded from memory and the contents of registers once they have been loaded from cache. It is because these effects are so powerful that Fortran preprocessors performing these kinds of transformations are able to increase certain SPECfp benchmark scores by orders of magnitude. |> That is, for each of the 225 entries there are 15 floating point |> multiplications to be performed. The amount of time taken to |> copy the 225 entries, even if you need to copy them several times, |> should be MUCH smaller than the amount of time spent in the |> calculation. But the timings above indicate that function |> calling overhead makes up something like 25% of the time taken! Well, 20% (11/55), but in any event I'm not surprised. Adding and multiplying floating-point numbers is the easy part. It's copying them that can slow you down. -- Norman H. Cohen ncohen@watson.ibm.com