From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=0.5 required=5.0 tests=BAYES_00,TO_NO_BRKTS_PCNT
	autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,35ed85a9d92a025
X-Google-Attributes: gid103376,public
From: jp3@rl.ac.uk ( Dr J Parker)
Subject: Ada-95 for numerics?
Date: 1996/04/05
Message-ID: <4k42bc$91m@newton.cc.rl.ac.uk>
X-Deja-AN: 146001140
organization: Rutherford Appleton Laboratory, Oxon, UK
newsgroups: comp.lang.ada
Date: 1996-04-05T00:00:00+00:00
List-Id: <comp.lang.ada>


 Joerg Rodemann writes:
 "Since I will have to do a lot of matrix diagonalization in the near future
(matrix dimensions will be about 1000x1000) I checked the speed different
languages produce. I used the routine tred2 from the "Numerical Recipes in C"
which brings a real symmetric matrix to a tri-diagonal form.
I tested the following languages/compilers:

   1.) C:
       Compilers: HP/UX cc, gcc-2.7.2
                  Solaris   gcc-2.7.2
   2.) Fortran-77:
          the routine was called from an Ada-95 main.
       Compilers: HP-UX f77
   3.) Ada-95:
       Compilers: GNAT 3.01/gcc-2.7.2
 a.) Has anybody out there used/tested Ada for similar problems? What were
     the results?"

Sure, I spent about 2 weeks working with gnat 2.02 on an HP 735,
trying to  optimize some scalar and some linear algebra programs.
I'll tell you what I learned, but remember it's limited
to a specific machine, an old gnat, and 2 weeks of tinkering.

Summary of everything I'm going to say: know the difference between
a machine portable compiler like gcc/gnat and a native Fortran compiler;
don't use unconstrained arrays; in multidimensional arrays, vary outer
indices fastest; unroll critical linear algebra loops by hand;
after that use the -funroll-all-loops flag; write critical inner
loops in lots of different ways and try various combinations of
compiler flags; if you are doing linear algebra,
then no matter what language or compiler you are using, consider
linking to BLAS (basic linear algebra subroutines); if you don't
want to use BLAS, write your own BLAS-like package - here is 
where you definitely suppress checks; the
scalar benchmark presented below (an fft) has gcc/gnat running
at 80% to 110%  HP Fortran speed; gcc/gnat peaked at 64 megaflops, 
and the HP's native Fortran at 78 megaflops, on a machine with a
(100 x 100) linpack rating of about 50 megaflops.  (You can stop
reading here :)

Regarding gcc/gnat vs. native compilers:

The machine portable code generator  used by gcc/gnat will
only very rarely give you the top performance that the
native compiler gives you, I mean if you are comparing number
crunching inner loops on a native Fortran compiler. For example
if gnat is giving you 80% HP-Fortran speed on the HP, then you are
doing very well. The native compiler's code generator knows tricks
about the machine that gcc will never know and it has been carefully
tuned for many years to do well on the kinds of simple loops
that show up in standard benchmarks like linpack.  On the other hand
if you can't get gcc/gnat to produce executables that run at the same 
speed as gcc/C and gcc/Fortran (g77) then you are doing something 
wrong.  (Actually I ran some comparisons of g77 and gnat: gnat was 
consistently slightly faster.  From this you cannot conclude that Ada
is faster than Fortran!   Anyway it was an early g77.)

So on an engineering workstation with an  optimized native Fortran,
you can expect the Fortran to consistently produce better average
performance and better peak performance than portable compilers.
Nevertheless there is a down side to this: you are fooling yourself
if you think any Fortran compiler will consistently get a high
percentage of the machine's peak  from most loops you feed it.
First of all, beyond a few simple loops that appear in common
benchmarks, you shouldn't expect too much from the optimizer. Second,
its very often the case in my experience that you have to spend
a lot of time rewriting loops to find the one that the optimizer
smiles on.  In fact, honest truth, I just spent a week
rewriting Fortran inner loops until I found  the ones that the
optimizer liked - doubled the speed of some of them.   (The program
runs for days on a massively parallel supercomputer.)  Good
Fortran programmers will often automatically write their
programs to use BLAS, so they don't have to reinvent these particular
wheels the hard way.

So that's why many of us are happy using the gcc family - you can
expect to do more work optimizing, but you may have to do that
work in any language, with any compiler. And of course we get to
use our favorite language.

Regarding performance of gcc/gnat:

Here are a few notes on optimizing with GNAT 2.02.

1. Avoid unconstrained arrays in inner loops - often makes life difficult
for the optimizer, especially if they are generic formals.  I greatly
prefer the services and abstraction provided by generics to that provided
by unconstrained arrays.  You don't need them.

2.  Try to write loops to vary the outer indices  of the array fastest.
Remember, Fortran prefers inner indices varied fastest.
Gnat will give you a choice on which indices you should vary fastest,
but you have to know what pragma to use and most compilers don't yet
offer this Ada 95 feature.

In your householder routine, the arrays are symmetric, a (i, j) = a (j, i).
Its easy to rewrite the inner loops so that the outer indices vary
fastest.  I always make the arrays as one dimensional arrays of
one dimensional arrays. Inner loops then become dot products
between rows of the matrix, which I pass to a dot product  routine
in which the loop has been unrolled by hand.  Gnat gets pretty good
performance this way. But if you write this dot product in a 
naive way, gnat's performance can drop by a factor of 3 to 6 on the HP.

Actually people who want the best performance unroll both the inner
and outer loops in your problem.  In fact there are automatic tools
that try every conceivable combination of unroll parameters to find
the best. I'll never bother with that kind of thing!  That's what
BLAS was invented for. 

3. If you are writing a BLAS-like package for gnat, here are the
kinds of things you should consider. This is from a package I wrote:

--  If you are going to do the dot-products in a high level language, then the
--  method that gives best efficiency will usually be different on each machine
--  and compiler. So the loop unrolling here should be taken as a hint at best,
--  but usually works better than no unrolling at all.
--
--  The procedure exports the best of several attempts to optimize
--  dot-products for gcc/GNAT (2.02) on an HP735. Three method are used.
--  The 3 methods are placed in a case statement.  This is convenient, and
--  its also a lot less error prone than alternatives.  (But the code
--  is considerably uglier, and the extra overhead noticable.)  Its important
--  to experiment with different compilation flags.  On gcc/GNAT, the
--  following seems to work best in most cases:
--
--   gcc -c -gnatp -O3 -funroll-all-loops -fomit-frame-pointer File_Name.adb
--
--  Even though some of the loops are unrolled by hand, subsequent application
--  of -funroll-all-loops is very important.  On the other hand, a better
--  compiler than gcc/gnat, one that knows more about the target arch., will
--  also know best how to unroll the loops for you.  In that case you should
--  use Ordinary_Loops:
--
--  Ordinary_Loops does no unrolling by hand. 
--  This should be compiled with gcc's -funroll-all-loops, because loop
--  bounds are known (in general) only at run time.
--
        Sum := 0.0;
        for L in Starting_Index..Ending_Index loop
           Sum := Sum + X(L)*Y(L);
        end loop;

--  In the 2nd method, the loops are also unrolled by hand (6 levels
--  deep) but the gcc/GNAT flag -funroll-all-loops is still required for
--  best performance.  This one is called Standard_Unrolled. Here is
--   part of the inner loop:
--
          for Segments in 2..Unroll_Segments loop
              --L0 := L5 + 1; L1 := L0 + 1; L2 := L1 + 1;
              --L3 := L2 + 1; L4 := L3 + 1; L5 := L4 + 1;
              L0 := L0 + Unroll_Length_Stnd; L1 := L1 + Unroll_Length_Stnd;
              L2 := L2 + Unroll_Length_Stnd; L3 := L3 + Unroll_Length_Stnd;
              L4 := L4 + Unroll_Length_Stnd; L5 := L5 + Unroll_Length_Stnd;
              Sum := Sum + X(L0)*Y(L0) + X(L1)*Y(L1) + X(L2)*Y(L2) + X(L3)*Y(L3)
                         + X(L4)*Y(L4) + X(L5)*Y(L5);
           end loop;
           --  Best choice of increments varies.  Chosen increments of L0, L1,
           -- Enable possible parallel execution.

--  The 3rd method wraps a slightly unrolled dot-product in a loop with
--  static bounds.  This can be compiled with -funroll-loops rather than
--  -funroll-all-loops (which is required for non-static bounds).  This got
--  the best performance from gcc/GNAT on the hppa.
--  This one is called Static_Unrolled.

           for Segments in 2..Unroll_Segments loop
           for N in Integer range 1..Half_Unroll_Length_Static loop
              L0 := L1 + 1; L1 := L0 + 1;
            --L0 := L0 + 2; L1 := L1 + 2;
              Sum := Sum + X(L0)*Y(L0) + X(L1)*Y(L1);
           end loop;
           end loop;
           --  Best choice of increments varies.

 Two final notes:
 a) there are lots of other tricks worth trying.
 b) to get gcc to unroll, instantiate generics with 0 based arrays.

  
4.   To give you an idea of what to expect from gnat, here's a 
benchmark of an fft I wrote.  It doesn't use any of the tricks
I describe above (there's no linear algebra in it). I didn't do
much work to optimize, so I'm fairly impressed with gnat here.

The comparison is with an identical version of the fft in Fortran,
compiled with the latest version of HP's native Fortran compiler.  
All results are megaflops of performance on an HP-735/125.

1. Performance in megaflops of procedure FFT, on data in an array of length
   8192.  The FFT is performed on subsets of the data.  The subsets are of
   length N:[Note: f77 inlined. GNAT 2.02 didn't.]


                    N = 256    512    1024   2048    4096   8192
      ---------------------------------------------------------

gnat/Ada   gcc -O3      54      55     61     60      64     63   megaflops

Fortran    f77 +O4      66      69     78     77      70     55   megaflops

Ratio      gcc/f77      82%     80%    78%    78%     91%    114%

   Megaflops are calculated by using  4*N*Log_Base_2(N)  as the number
   of floating point operations per call of the Radix 4 FFT.  Bit
   reversals of the data are performed each call, so the timings include
   overhead from bit-reversals and the procedure calls.  They are insensitive 
   to overhead from initialization of arrays of Sin's and Cos's.

   The +O4 optimization gave the fastest HP f77 results.  The best GNAT/gcc
   results came from:

     gcc -c -gnatp -O3 -funroll-all-loops File_Name.adb

   On machines with fewer registers, the -fomit-frame-pointer gcc flag might
   be useful.  The -funroll-all-loops only helped a little  (1%).

   SUMMARY: 
   
   The GNAT/gcc compilation ran at about 80% to 110% the speed of the HP f77 
   compilations.  80% was typical in the common benchmark limits.  As the
   arrays grew larger, performance of the 2 grew comparable.  (Cache size
   limitations begin to matter in the limit of large arrays.)
   
2. On an array with 16384 points, with N = 16384 (the number of points the
   FFT is performed on) we get the following results in megaflops: 

    gcc -O3   43 megaflops

    f77 +O4   52 megaflops 

3. On an array with 2**15 + 219 points, with N = 2**15 (the number of points
   the FFT is performed on) we get the following results in megaflops: 

    gcc -O3   16 megaflops

    f77 +O4   16 megaflops 
  
   The + 219 is there because it reduces cache thrashing, improving
   performance.  Other FFT algorithms are needed to really get
   around these Cache problems. Here speed is limited by the fact that
   the FFT reads quasi randomly from an array that is to large for the
   Cache.

4. The FFT's are standard Radix 4 Cooley-Tukey FFT's.  All
   arithmetic is real.  The Fortran and Ada programs are virtually
   identical.  When the same program is rewritten to use complex
   arrays, (but still real arithmetic in the inner loops)
   then the GNAT/gcc compilation runs 15% slower.

	
--               Jonathan Parker (tcp0063@harrier.am.qub.ac.uk)
--               Department of Applied Mathematics and Theoretical Physics
--               The Queen's University of Belfast
--