From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=0.5 required=5.0 tests=BAYES_00,TO_NO_BRKTS_PCNT autolearn=no autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,35ed85a9d92a025 X-Google-Attributes: gid103376,public From: jp3@rl.ac.uk ( Dr J Parker) Subject: Ada-95 for numerics? Date: 1996/04/05 Message-ID: <4k42bc$91m@newton.cc.rl.ac.uk> X-Deja-AN: 146001140 organization: Rutherford Appleton Laboratory, Oxon, UK newsgroups: comp.lang.ada Date: 1996-04-05T00:00:00+00:00 List-Id: Joerg Rodemann writes: "Since I will have to do a lot of matrix diagonalization in the near future (matrix dimensions will be about 1000x1000) I checked the speed different languages produce. I used the routine tred2 from the "Numerical Recipes in C" which brings a real symmetric matrix to a tri-diagonal form. I tested the following languages/compilers: 1.) C: Compilers: HP/UX cc, gcc-2.7.2 Solaris gcc-2.7.2 2.) Fortran-77: the routine was called from an Ada-95 main. Compilers: HP-UX f77 3.) Ada-95: Compilers: GNAT 3.01/gcc-2.7.2 a.) Has anybody out there used/tested Ada for similar problems? What were the results?" Sure, I spent about 2 weeks working with gnat 2.02 on an HP 735, trying to optimize some scalar and some linear algebra programs. I'll tell you what I learned, but remember it's limited to a specific machine, an old gnat, and 2 weeks of tinkering. Summary of everything I'm going to say: know the difference between a machine portable compiler like gcc/gnat and a native Fortran compiler; don't use unconstrained arrays; in multidimensional arrays, vary outer indices fastest; unroll critical linear algebra loops by hand; after that use the -funroll-all-loops flag; write critical inner loops in lots of different ways and try various combinations of compiler flags; if you are doing linear algebra, then no matter what language or compiler you are using, consider linking to BLAS (basic linear algebra subroutines); if you don't want to use BLAS, write your own BLAS-like package - here is where you definitely suppress checks; the scalar benchmark presented below (an fft) has gcc/gnat running at 80% to 110% HP Fortran speed; gcc/gnat peaked at 64 megaflops, and the HP's native Fortran at 78 megaflops, on a machine with a (100 x 100) linpack rating of about 50 megaflops. (You can stop reading here :) Regarding gcc/gnat vs. native compilers: The machine portable code generator used by gcc/gnat will only very rarely give you the top performance that the native compiler gives you, I mean if you are comparing number crunching inner loops on a native Fortran compiler. For example if gnat is giving you 80% HP-Fortran speed on the HP, then you are doing very well. The native compiler's code generator knows tricks about the machine that gcc will never know and it has been carefully tuned for many years to do well on the kinds of simple loops that show up in standard benchmarks like linpack. On the other hand if you can't get gcc/gnat to produce executables that run at the same speed as gcc/C and gcc/Fortran (g77) then you are doing something wrong. (Actually I ran some comparisons of g77 and gnat: gnat was consistently slightly faster. From this you cannot conclude that Ada is faster than Fortran! Anyway it was an early g77.) So on an engineering workstation with an optimized native Fortran, you can expect the Fortran to consistently produce better average performance and better peak performance than portable compilers. Nevertheless there is a down side to this: you are fooling yourself if you think any Fortran compiler will consistently get a high percentage of the machine's peak from most loops you feed it. First of all, beyond a few simple loops that appear in common benchmarks, you shouldn't expect too much from the optimizer. Second, its very often the case in my experience that you have to spend a lot of time rewriting loops to find the one that the optimizer smiles on. In fact, honest truth, I just spent a week rewriting Fortran inner loops until I found the ones that the optimizer liked - doubled the speed of some of them. (The program runs for days on a massively parallel supercomputer.) Good Fortran programmers will often automatically write their programs to use BLAS, so they don't have to reinvent these particular wheels the hard way. So that's why many of us are happy using the gcc family - you can expect to do more work optimizing, but you may have to do that work in any language, with any compiler. And of course we get to use our favorite language. Regarding performance of gcc/gnat: Here are a few notes on optimizing with GNAT 2.02. 1. Avoid unconstrained arrays in inner loops - often makes life difficult for the optimizer, especially if they are generic formals. I greatly prefer the services and abstraction provided by generics to that provided by unconstrained arrays. You don't need them. 2. Try to write loops to vary the outer indices of the array fastest. Remember, Fortran prefers inner indices varied fastest. Gnat will give you a choice on which indices you should vary fastest, but you have to know what pragma to use and most compilers don't yet offer this Ada 95 feature. In your householder routine, the arrays are symmetric, a (i, j) = a (j, i). Its easy to rewrite the inner loops so that the outer indices vary fastest. I always make the arrays as one dimensional arrays of one dimensional arrays. Inner loops then become dot products between rows of the matrix, which I pass to a dot product routine in which the loop has been unrolled by hand. Gnat gets pretty good performance this way. But if you write this dot product in a naive way, gnat's performance can drop by a factor of 3 to 6 on the HP. Actually people who want the best performance unroll both the inner and outer loops in your problem. In fact there are automatic tools that try every conceivable combination of unroll parameters to find the best. I'll never bother with that kind of thing! That's what BLAS was invented for. 3. If you are writing a BLAS-like package for gnat, here are the kinds of things you should consider. This is from a package I wrote: -- If you are going to do the dot-products in a high level language, then the -- method that gives best efficiency will usually be different on each machine -- and compiler. So the loop unrolling here should be taken as a hint at best, -- but usually works better than no unrolling at all. -- -- The procedure exports the best of several attempts to optimize -- dot-products for gcc/GNAT (2.02) on an HP735. Three method are used. -- The 3 methods are placed in a case statement. This is convenient, and -- its also a lot less error prone than alternatives. (But the code -- is considerably uglier, and the extra overhead noticable.) Its important -- to experiment with different compilation flags. On gcc/GNAT, the -- following seems to work best in most cases: -- -- gcc -c -gnatp -O3 -funroll-all-loops -fomit-frame-pointer File_Name.adb -- -- Even though some of the loops are unrolled by hand, subsequent application -- of -funroll-all-loops is very important. On the other hand, a better -- compiler than gcc/gnat, one that knows more about the target arch., will -- also know best how to unroll the loops for you. In that case you should -- use Ordinary_Loops: -- -- Ordinary_Loops does no unrolling by hand. -- This should be compiled with gcc's -funroll-all-loops, because loop -- bounds are known (in general) only at run time. -- Sum := 0.0; for L in Starting_Index..Ending_Index loop Sum := Sum + X(L)*Y(L); end loop; -- In the 2nd method, the loops are also unrolled by hand (6 levels -- deep) but the gcc/GNAT flag -funroll-all-loops is still required for -- best performance. This one is called Standard_Unrolled. Here is -- part of the inner loop: -- for Segments in 2..Unroll_Segments loop --L0 := L5 + 1; L1 := L0 + 1; L2 := L1 + 1; --L3 := L2 + 1; L4 := L3 + 1; L5 := L4 + 1; L0 := L0 + Unroll_Length_Stnd; L1 := L1 + Unroll_Length_Stnd; L2 := L2 + Unroll_Length_Stnd; L3 := L3 + Unroll_Length_Stnd; L4 := L4 + Unroll_Length_Stnd; L5 := L5 + Unroll_Length_Stnd; Sum := Sum + X(L0)*Y(L0) + X(L1)*Y(L1) + X(L2)*Y(L2) + X(L3)*Y(L3) + X(L4)*Y(L4) + X(L5)*Y(L5); end loop; -- Best choice of increments varies. Chosen increments of L0, L1, -- Enable possible parallel execution. -- The 3rd method wraps a slightly unrolled dot-product in a loop with -- static bounds. This can be compiled with -funroll-loops rather than -- -funroll-all-loops (which is required for non-static bounds). This got -- the best performance from gcc/GNAT on the hppa. -- This one is called Static_Unrolled. for Segments in 2..Unroll_Segments loop for N in Integer range 1..Half_Unroll_Length_Static loop L0 := L1 + 1; L1 := L0 + 1; --L0 := L0 + 2; L1 := L1 + 2; Sum := Sum + X(L0)*Y(L0) + X(L1)*Y(L1); end loop; end loop; -- Best choice of increments varies. Two final notes: a) there are lots of other tricks worth trying. b) to get gcc to unroll, instantiate generics with 0 based arrays. 4. To give you an idea of what to expect from gnat, here's a benchmark of an fft I wrote. It doesn't use any of the tricks I describe above (there's no linear algebra in it). I didn't do much work to optimize, so I'm fairly impressed with gnat here. The comparison is with an identical version of the fft in Fortran, compiled with the latest version of HP's native Fortran compiler. All results are megaflops of performance on an HP-735/125. 1. Performance in megaflops of procedure FFT, on data in an array of length 8192. The FFT is performed on subsets of the data. The subsets are of length N:[Note: f77 inlined. GNAT 2.02 didn't.] N = 256 512 1024 2048 4096 8192 --------------------------------------------------------- gnat/Ada gcc -O3 54 55 61 60 64 63 megaflops Fortran f77 +O4 66 69 78 77 70 55 megaflops Ratio gcc/f77 82% 80% 78% 78% 91% 114% Megaflops are calculated by using 4*N*Log_Base_2(N) as the number of floating point operations per call of the Radix 4 FFT. Bit reversals of the data are performed each call, so the timings include overhead from bit-reversals and the procedure calls. They are insensitive to overhead from initialization of arrays of Sin's and Cos's. The +O4 optimization gave the fastest HP f77 results. The best GNAT/gcc results came from: gcc -c -gnatp -O3 -funroll-all-loops File_Name.adb On machines with fewer registers, the -fomit-frame-pointer gcc flag might be useful. The -funroll-all-loops only helped a little (1%). SUMMARY: The GNAT/gcc compilation ran at about 80% to 110% the speed of the HP f77 compilations. 80% was typical in the common benchmark limits. As the arrays grew larger, performance of the 2 grew comparable. (Cache size limitations begin to matter in the limit of large arrays.) 2. On an array with 16384 points, with N = 16384 (the number of points the FFT is performed on) we get the following results in megaflops: gcc -O3 43 megaflops f77 +O4 52 megaflops 3. On an array with 2**15 + 219 points, with N = 2**15 (the number of points the FFT is performed on) we get the following results in megaflops: gcc -O3 16 megaflops f77 +O4 16 megaflops The + 219 is there because it reduces cache thrashing, improving performance. Other FFT algorithms are needed to really get around these Cache problems. Here speed is limited by the fact that the FFT reads quasi randomly from an array that is to large for the Cache. 4. The FFT's are standard Radix 4 Cooley-Tukey FFT's. All arithmetic is real. The Fortran and Ada programs are virtually identical. When the same program is rewritten to use complex arrays, (but still real arithmetic in the inner loops) then the GNAT/gcc compilation runs 15% slower. -- Jonathan Parker (tcp0063@harrier.am.qub.ac.uk) -- Department of Applied Mathematics and Theoretical Physics -- The Queen's University of Belfast --