GNAT compiler switches and optimization

comp.lang.ada
 help / color / mirror / Atom feed

* GNAT compiler switches and optimization
@ 2006-10-20 10:47 tkrauss
  2006-10-20 11:04 ` Duncan Sands
                   ` (12 more replies)
  0 siblings, 13 replies; 68+ messages in thread
From: tkrauss @ 2006-10-20 10:47 UTC (permalink / raw)


I'm a bit stuck trying to figure out how to coax more performance
out of some Ada code.  I suspect there is something simple (like
compiler switches) but I'm missing it.  As an example I'm using
a simple matrix multiply and comparing it to similar code in
Fortran.  Unfortunately the Ada code takes 3-4 times as long.

I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following
compile options:

gnat make -O3 -gnatp tst_array
gfortran -O3 tst_array.f95

Running them on 800x800 matrices (on my 2GHz laptop)

for Ada: "tst_array 800" runs in 18 seconds
for Fortran "tst_array 800" runs in 6 seconds

(if I use the fortran "matmul" intrinsic the fortran time drops to
2.5 seconds)

Note, I tried reordering the loops, removing the random calls, etc.
none of which made huge changes.  There is something killing
performance
and/or a switch or two that I'm missing, but I can't seem to find it.
Any
thoughts?


Here is the code:


-- tst_array.adb
with Ada.Numerics.Float_Random;      use Ada.Numerics.Float_Random;
with Ada.Command_Line;               use Ada.Command_Line;
with Ada.Text_IO;                    use Ada.Text_IO;
with Ada.Calendar;                   use Ada.Calendar;

procedure tst_array is
   package F_IO is new Ada.Text_IO.Float_IO(Float);
   package D_IO is new Ada.Text_Io.Fixed_Io(Duration);
   type Real_Matrix is array(Integer range <>,Integer range <>) of
Float;
   type Matrix_Access is access Real_Matrix;

   N    : Positive;
   G    : Ada.Numerics.Float_Random.Generator;

   A,B,C    : Matrix_Access;
   Start, Finish : Ada.Calendar.Time;
   Sum : Float := 0.0;
begin
   N := Positive'Value (Argument (1));
   Start := Ada.Calendar.Clock;

   A := new Real_Matrix(1..N, 1..N);
   B := new Real_Matrix(1..N, 1..N);
   C := new Real_Matrix(1..N, 1..N);
   for I in A'range(1) loop
      for J in A'range(2) loop
         A(I,J) := Ada.Numerics.Float_Random.Random(G);
         B(I,J) := Ada.Numerics.Float_Random.Random(G);
      end loop;
   end loop;

   for I in A'range(1) loop
      for J in A'range(2) loop
         Sum := 0.0;
         for R in A'range(2) loop
            Sum := Sum + A(I,R)*B(R,J);
         end loop;
         C(I,J) := Sum;
      end loop;
   end loop;

   Finish := Ada.Calendar.Clock;

   F_IO.Put (C(1,1)); F_IO.Put (C(1,2)); New_Line;
   F_IO.Put (C(2,1)); F_IO.Put (C(2,2)); New_Line;

   Finish := Ada.Calendar.Clock;
   Put ("Time: "); D_IO.Put(Finish-Start); New_Line;New_Line;
end tst_array;



-- tst_array.f95
program test_array
   integer,parameter :: seed = 86456
   integer :: N
   integer :: numArgs
   real, dimension(:,:), allocatable :: a, b, c
   real, dimension(:,:), allocatable :: d, e, f
   real :: sum
   character*100 buffer
   real :: begin, finish
   integer :: I, J, R

   call getarg(1,buffer)
   read(buffer,*) N

   call srand(seed)

   begin = second()

   allocate( a(N,N) )
   allocate( b(N,N) )
   allocate( c(N,N) )
   forall (I = 1:N, J = 1:N)
      a(i,j) = rand()
      b(i,j) = rand()
   end forall

   !c = matmul(a, b)
   do I = 1,N
      do J = 1,N
         sum = 0.0
         do R = 1,N
            sum = sum + A(I,R)*B(R,J)
         end do
         C(I,J) = sum
      end do
   end do

   print *, c(1,1), c(1,2), c(2,1), c(2,2)
   finish = second()
   print *, 'Time: ', (finish-begin)
   
end program test_array




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
@ 2006-10-20 11:04 ` Duncan Sands
  2006-10-21 10:45   ` Stephen Leake
  2006-10-20 11:42 ` Duncan Sands
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Duncan Sands @ 2006-10-20 11:04 UTC (permalink / raw)
  To: comp.lang.ada; +Cc: tkrauss

On Friday 20 October 2006 12:47, tkrauss wrote:
> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code.  I suspect there is something simple (like
> compiler switches) but I'm missing it.  As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran.  Unfortunately the Ada code takes 3-4 times as long.

GNAT GPL 2006 is based on gcc 3.4.6.  For fortran you are using
gcc 4.2.0.  Try using comparable compiler versions, eg: an Ada
aware gcc 4.2.0 (several linux distributions provide this) or
a gcc 3.4.6 version of fortran (i.e. some version of g77).

Ciao,

Duncan.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 11:04 ` Duncan Sands
@ 2006-10-21 10:45   ` Stephen Leake
  0 siblings, 0 replies; 68+ messages in thread
From: Stephen Leake @ 2006-10-21 10:45 UTC (permalink / raw)


Duncan Sands <duncan.sands@math.u-psud.fr> writes:

> On Friday 20 October 2006 12:47, tkrauss wrote:
>> I'm a bit stuck trying to figure out how to coax more performance
>> out of some Ada code.  I suspect there is something simple (like
>> compiler switches) but I'm missing it.  As an example I'm using
>> a simple matrix multiply and comparing it to similar code in
>> Fortran.  Unfortunately the Ada code takes 3-4 times as long.
>
> GNAT GPL 2006 is based on gcc 3.4.6.  For fortran you are using
> gcc 4.2.0.  Try using comparable compiler versions, eg: an Ada
> aware gcc 4.2.0 (several linux distributions provide this) or
> a gcc 3.4.6 version of fortran (i.e. some version of g77).


I have a pre-release of GNAT 5.05, which is based on gcc 4.2. I tried
tst_array with both -O2 and -O3; -O2 was faster. AdaCore says -O3 has
"dangerous experimental optimizations; don't use it". Here's a
comparison:

GNAT 5.04, -O2 -gnatp:

./tst_array.exe 800
Init Time:           0.557314686

Mult Time:          20.139092538

 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02


GNAT 5.05, -O2 -gnatp:

./tst_array.exe 800
Init Time:           0.575690486

Mult Time:          12.160329316

 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02


Note that I timed the initialize with random separate from the matrix
multiply; it could easily be that the random number generators are
different. But that's a small effect.

I don't have gfortran installed.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
  2006-10-20 11:04 ` Duncan Sands
@ 2006-10-20 11:42 ` Duncan Sands
  2006-10-20 15:41   ` Martin Krischik
  2006-10-20 12:09 ` Samuel Tardieu
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Duncan Sands @ 2006-10-20 11:42 UTC (permalink / raw)
  To: comp.lang.ada; +Cc: tkrauss

On Friday 20 October 2006 12:47, tkrauss wrote:
> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code.  I suspect there is something simple (like
> compiler switches) but I'm missing it.  As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran.  Unfortunately the Ada code takes 3-4 times as long.

GNAT GPL 2006 is based on gcc 3.4.6.  For fortran you are using
gcc 4.2.0.  Try using comparable compiler versions, eg: an Ada
aware gcc 4.2.0 (several linux distributions provide this) or
a gcc 3.4.6 version of fortran (i.e. some version of g77).

Ciao,

Duncan.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 11:42 ` Duncan Sands
@ 2006-10-20 15:41   ` Martin Krischik
  0 siblings, 0 replies; 68+ messages in thread
From: Martin Krischik @ 2006-10-20 15:41 UTC (permalink / raw)


Duncan Sands wrote:

> On Friday 20 October 2006 12:47, tkrauss wrote:
>> I'm a bit stuck trying to figure out how to coax more performance
>> out of some Ada code.  I suspect there is something simple (like
>> compiler switches) but I'm missing it.  As an example I'm using
>> a simple matrix multiply and comparing it to similar code in
>> Fortran.  Unfortunately the Ada code takes 3-4 times as long.
> 
> GNAT GPL 2006 is based on gcc 3.4.6.  For fortran you are using
> gcc 4.2.0.  Try using comparable compiler versions, eg: an Ada
> aware gcc 4.2.0 (several linux distributions provide this) or
> a gcc 3.4.6 version of fortran (i.e. some version of g77).

Indeed. GNAT/GPL:

./Linux-x86_64-Release/Test_Array_1 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:          12.265258000

GNAT/GCC:

./Linux-x86_64-Release/Test_Array_1 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:          10.631787000

So who said the 4.1.x compiler are slower...

Martin
-- 
mailto://krischik@users.sourceforge.net
Ada programming at: http://ada.krischik.com



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
  2006-10-20 11:04 ` Duncan Sands
  2006-10-20 11:42 ` Duncan Sands
@ 2006-10-20 12:09 ` Samuel Tardieu
  2006-10-20 12:18   ` Samuel Tardieu
  2006-10-20 12:12 ` Gautier
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Samuel Tardieu @ 2006-10-20 12:09 UTC (permalink / raw)


>>>>> "tkrauss" == tkrauss  <thomas.krauss@gmail.com> writes:

tkrauss> Running them on 800x800 matrices (on my 2GHz laptop)

tkrauss> for Ada: "tst_array 800" runs in 18 seconds for Fortran
tkrauss> "tst_array 800" runs in 6 seconds

tkrauss> (if I use the fortran "matmul" intrinsic the fortran time
tkrauss> drops to 2.5 seconds)

tkrauss> Note, I tried reordering the loops, removing the random
tkrauss> calls, etc.  none of which made huge changes.  There is
tkrauss> something killing performance and/or a switch or two that I'm
tkrauss> missing, but I can't seem to find it.  Any thoughts?

First of all, what you measure is not only the matrix multiplication
time but also the operation of filling the matrices with random
numbers. I've moved the "start" initialization after the matrices
initialization.

The following optimizations make the difference smaller (9.47 seconds
for Fortran vs. 11.90 seconds for Ada on my machine):

  - use -fomit-frame-pointer on gnatmake command line (this doesn't
    change anything in the Fortran case)

  - add: pragma Convention (Fortran, Real_Matrix) to invert the
    storage method (line vs. column); I guess this helps maintaining
    more data in the cache

  - use 1 .. N as loop indices instead of A'Range (1) and friends;
    this is more equivalent to the Fortran code you posted

Still, this is a huge penaly for Ada. Unfortunately, I don't have the
time to investigate further right now. However, I would be interested
in other people findings.

  Sam
-- 
Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 12:09 ` Samuel Tardieu
@ 2006-10-20 12:18   ` Samuel Tardieu
  0 siblings, 0 replies; 68+ messages in thread
From: Samuel Tardieu @ 2006-10-20 12:18 UTC (permalink / raw)


>>>>> "Sam" == Samuel Tardieu <sam@rfc1149.net> writes:

Sam> Still, this is a huge penaly for Ada. Unfortunately, I don't have
Sam> the time to investigate further right now. However, I would be
Sam> interested in other people findings.

Oh, also this one lowers the Ada execution time by around 10%: do not
use an unconstrained type and do not use pointers.

   N    : constant Positive := Positive'Value (Argument (1));
   G    : Ada.Numerics.Float_Random.Generator;

   type Real_Matrix is array(1 .. N, 1 .. N) of Float;
   pragma Convention (Fortran, Real_Matrix);

   A,B,C    : Real_Matrix;

And as Duncan says, you should try with a newer backend for Ada.

  Sam
-- 
Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (2 preceding siblings ...)
  2006-10-20 12:09 ` Samuel Tardieu
@ 2006-10-20 12:12 ` Gautier
  2006-10-20 12:35 ` Dmitry A. Kazakov
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: Gautier @ 2006-10-20 12:12 UTC (permalink / raw)


tkrauss:
> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code.  I suspect there is something simple (like
> compiler switches) but I'm missing it.  As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran.  Unfortunately the Ada code takes 3-4 times as long.

One thing is already that you start the timer at the wrong place.
You should start it after filling your array with random numbers.
In the present state you compare the random generator, then your matrix 
multiplication. It is possible that Ada.Numerics.Float_Random.Random takes 
significantly more time due to Ada's quality requirements in the random 
generation.
On the other hand, switches you can try are:
  -O2 instead of -O3, -funroll-loops (usually good) -ffast-math,
  for both Ada and Fortran
  -gnatn, for Ada
Since you are measuring real time and not CPU time, you might want also to 
take a bit larger matrices in order to have disk swaps and rests of 
initializations effects statistically small.

HTH, Gautier
______________________________________________________________
Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm

NB: For a direct answer, e-mail address on the Web site!!



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (3 preceding siblings ...)
  2006-10-20 12:12 ` Gautier
@ 2006-10-20 12:35 ` Dmitry A. Kazakov
  2006-10-20 15:53   ` Martin Krischik
  2006-10-20 12:52 ` Gautier
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Dmitry A. Kazakov @ 2006-10-20 12:35 UTC (permalink / raw)


On 20 Oct 2006 03:47:44 -0700, tkrauss wrote:

> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code.  I suspect there is something simple (like
> compiler switches) but I'm missing it.  As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran.  Unfortunately the Ada code takes 3-4 times as long.
> 
> I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following
> compile options:
> 
> gnat make -O3 -gnatp tst_array
> gfortran -O3 tst_array.f95
> 
> Running them on 800x800 matrices (on my 2GHz laptop)
> 
> for Ada: "tst_array 800" runs in 18 seconds
> for Fortran "tst_array 800" runs in 6 seconds
> 
> (if I use the fortran "matmul" intrinsic the fortran time drops to
> 2.5 seconds)
> 
> Note, I tried reordering the loops, removing the random calls, etc.
> none of which made huge changes.  There is something killing
> performance
> and/or a switch or two that I'm missing, but I can't seem to find it.
> Any thoughts?
> 
> Here is the code:
[...]

Pointers + index checks I'd say. Try this:

with Ada.Numerics.Float_Random;      use Ada.Numerics.Float_Random;
with Ada.Command_Line;               use Ada.Command_Line;
with Ada.Text_IO;                    use Ada.Text_IO;
with Ada.Calendar;                   use Ada.Calendar;

procedure tst_array_2 is
   package F_IO is new Ada.Text_IO.Float_IO(Float);
   package D_IO is new Ada.Text_Io.Fixed_Io(Duration);
   type Real_Matrix is array(Integer range <>,Integer range <>) of Float;

   N    : Positive;
   G    : Ada.Numerics.Float_Random.Generator;

   Start, Finish : Ada.Calendar.Time;
   Sum : Float := 0.0;
begin
   N := Positive'Value (Argument (1));
   declare
      subtype Matrix is Real_Matrix (1..N, 1..N);
      A, B, C : Matrix;
   begin
      for I in A'range(1) loop
         for J in A'range(2) loop
            A(I,J) := Ada.Numerics.Float_Random.Random(G);
            B(I,J) := Ada.Numerics.Float_Random.Random(G);
         end loop;
      end loop;

      Start := Ada.Calendar.Clock;

      for I in A'range(1) loop
         for J in A'range(2) loop
            Sum := 0.0;
            for R in A'range(2) loop
               Sum := Sum + A(I,R)*B(R,J);
            end loop;
            C(I,J) := Sum;
         end loop;
      end loop;

      Finish := Ada.Calendar.Clock;

      F_IO.Put (C(1,1)); F_IO.Put (C(1,2)); New_Line;
      F_IO.Put (C(2,1)); F_IO.Put (C(2,2)); New_Line;
   end;
   Put ("Time: "); D_IO.Put(Finish-Start); New_Line;New_Line;
end tst_array_2;

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 12:35 ` Dmitry A. Kazakov
@ 2006-10-20 15:53   ` Martin Krischik
  0 siblings, 0 replies; 68+ messages in thread
From: Martin Krischik @ 2006-10-20 15:53 UTC (permalink / raw)


Dmitry A. Kazakov wrote:

> declare
> subtype Matrix is Real_Matrix (1..N, 1..N);
> A, B, C : Matrix;
> begin

Indeed faster:

./Linux-x86_64-Release/Test_Array_1 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:          10.459045000

./Linux-x86_64-Release/Test_Array_2 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:           6.313824000

Martin
-- 
mailto://krischik@users.sourceforge.net
Ada programming at: http://ada.krischik.com



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (4 preceding siblings ...)
  2006-10-20 12:35 ` Dmitry A. Kazakov
@ 2006-10-20 12:52 ` Gautier
  2006-10-20 13:27 ` claude.simon
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: Gautier @ 2006-10-20 12:52 UTC (permalink / raw)


tkrauss wrote:

>    N := Positive'Value (Argument (1));
...
>    for I in A'range(1) loop
>       for J in A'range(2) loop
probably you have a penalty there against
>    do I = 1,N
>       do J = 1,N
I guess the GNAT compiler can't tell that due to
 >   A := new Real_Matrix(1..N, 1..N);
esp. because alloc. is dynamic, you indeed have:
   for I in 1..N loop
Then you should try that and even
   for I in reverse 1..N loop
Perhaps g95 does that optimization when Ada has to stick with the mentioned 
direction.
Also, compiling with -S and looking at the assembler code helps a lot...
HTH, Gautier
______________________________________________________________
Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm

NB: For a direct answer, e-mail address on the Web site!



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (5 preceding siblings ...)
  2006-10-20 12:52 ` Gautier
@ 2006-10-20 13:27 ` claude.simon
  2006-10-20 15:38 ` Robert A Duff
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: claude.simon @ 2006-10-20 13:27 UTC (permalink / raw)


with some change you could speedup your code :

First the matrix product is not optimum, you can transpose B.

Second, matrix A is used  vector by vector you can copy A (I,1..N) in a
vector, say AI and replace A(I,R) by AI(R) in the product (it is to the
compiler to see that :-( ).

By apply this change I got an ada version faster that the fortran one
(at the same -O3 optimization level).

You can make the same change to your fortran code and then search a new
trick to speedup your ada code :-).

Claude SIMON

tkrauss a écrit :

> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code.  I suspect there is something simple (like
> compiler switches) but I'm missing it.  As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran.  Unfortunately the Ada code takes 3-4 times as long.
>
> I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following
> compile options:
>
> gnat make -O3 -gnatp tst_array
> gfortran -O3 tst_array.f95
>
> Running them on 800x800 matrices (on my 2GHz laptop)
>
> for Ada: "tst_array 800" runs in 18 seconds
> for Fortran "tst_array 800" runs in 6 seconds
>
> (if I use the fortran "matmul" intrinsic the fortran time drops to
> 2.5 seconds)
>
> Note, I tried reordering the loops, removing the random calls, etc.
> none of which made huge changes.  There is something killing
> performance
> and/or a switch or two that I'm missing, but I can't seem to find it.
> Any
> thoughts?
>
>
> Here is the code:
>
>
> -- tst_array.adb
> with Ada.Numerics.Float_Random;      use Ada.Numerics.Float_Random;
> with Ada.Command_Line;               use Ada.Command_Line;
> with Ada.Text_IO;                    use Ada.Text_IO;
> with Ada.Calendar;                   use Ada.Calendar;
>
> procedure tst_array is
>    package F_IO is new Ada.Text_IO.Float_IO(Float);
>    package D_IO is new Ada.Text_Io.Fixed_Io(Duration);
>    type Real_Matrix is array(Integer range <>,Integer range <>) of
> Float;
>    type Matrix_Access is access Real_Matrix;
>
>    N    : Positive;
>    G    : Ada.Numerics.Float_Random.Generator;
>
>    A,B,C    : Matrix_Access;
>    Start, Finish : Ada.Calendar.Time;
>    Sum : Float := 0.0;
> begin
>    N := Positive'Value (Argument (1));
>    Start := Ada.Calendar.Clock;
>
>    A := new Real_Matrix(1..N, 1..N);
>    B := new Real_Matrix(1..N, 1..N);
>    C := new Real_Matrix(1..N, 1..N);
>    for I in A'range(1) loop
>       for J in A'range(2) loop
>          A(I,J) := Ada.Numerics.Float_Random.Random(G);
>          B(I,J) := Ada.Numerics.Float_Random.Random(G);
>       end loop;
>    end loop;
>
>    for I in A'range(1) loop
>       for J in A'range(2) loop
>          Sum := 0.0;
>          for R in A'range(2) loop
>             Sum := Sum + A(I,R)*B(R,J);
>          end loop;
>          C(I,J) := Sum;
>       end loop;
>    end loop;
>
>    Finish := Ada.Calendar.Clock;
>
>    F_IO.Put (C(1,1)); F_IO.Put (C(1,2)); New_Line;
>    F_IO.Put (C(2,1)); F_IO.Put (C(2,2)); New_Line;
>
>    Finish := Ada.Calendar.Clock;
>    Put ("Time: "); D_IO.Put(Finish-Start); New_Line;New_Line;
> end tst_array;
>
>
>
> -- tst_array.f95
> program test_array
>    integer,parameter :: seed = 86456
>    integer :: N
>    integer :: numArgs
>    real, dimension(:,:), allocatable :: a, b, c
>    real, dimension(:,:), allocatable :: d, e, f
>    real :: sum
>    character*100 buffer
>    real :: begin, finish
>    integer :: I, J, R
>
>    call getarg(1,buffer)
>    read(buffer,*) N
>
>    call srand(seed)
>
>    begin = second()
>
>    allocate( a(N,N) )
>    allocate( b(N,N) )
>    allocate( c(N,N) )
>    forall (I = 1:N, J = 1:N)
>       a(i,j) = rand()
>       b(i,j) = rand()
>    end forall
>
>    !c = matmul(a, b)
>    do I = 1,N
>       do J = 1,N
>          sum = 0.0
>          do R = 1,N
>             sum = sum + A(I,R)*B(R,J)
>          end do
>          C(I,J) = sum
>       end do
>    end do
>
>    print *, c(1,1), c(1,2), c(2,1), c(2,2)
>    finish = second()
>    print *, 'Time: ', (finish-begin)
>    
> end program test_array




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (6 preceding siblings ...)
  2006-10-20 13:27 ` claude.simon
@ 2006-10-20 15:38 ` Robert A Duff
  2006-10-20 19:32   ` Gautier
  2006-10-20 15:56 ` Jeffrey Creem
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: Robert A Duff @ 2006-10-20 15:38 UTC (permalink / raw)


"tkrauss" <thomas.krauss@gmail.com> writes:

> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code. ...

In some cases, -O3 is slower than -O2.  You could try that experiment.

Is the Fortran compiler generating array-bounds checks?  If not,
pragma Suppress(All_Checks) in the Ada version will make the test more
fair.

- Bob



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 15:38 ` Robert A Duff
@ 2006-10-20 19:32   ` Gautier
  0 siblings, 0 replies; 68+ messages in thread
From: Gautier @ 2006-10-20 19:32 UTC (permalink / raw)


Robert A Duff:

> Is the Fortran compiler generating array-bounds checks?  If not,
> pragma Suppress(All_Checks) in the Ada version will make the test more
> fair.

Thomas mentioned the -gnatp option, that has the same effect as "pragma 
Suppress(All_Checks)" in the source.

G.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (7 preceding siblings ...)
  2006-10-20 15:38 ` Robert A Duff
@ 2006-10-20 15:56 ` Jeffrey Creem
  2006-10-20 16:30 ` Martin Krischik
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-20 15:56 UTC (permalink / raw)


tkrauss wrote:
> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code.  I suspect there is something simple (like
> compiler switches) but I'm missing it.  As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran.  Unfortunately the Ada code takes 3-4 times as long.
> 

There have been a few useful comments (and quite a few not really useful 
ones) but in the end, it seems pretty clear to me that in this 
particular case GNAT sucks compared to the fortran version.

I built the gcc "head" from gcc SVN with GNAT and Fortran to compare the 
  same versions (at least as much as possible).

I moved the start timing calls after the array allocation and filling so 
we just timing the matrix multiplication

I end moved the timing calls to make sure we were not timing IO in 
either case (both original versions were timing part of the "put").

I replaced the "random" data with some fixed sane data just to be sure 
there was no funky "denormal" stuff happening that changed the speed.

Very little change in the order of magnitude that the original poster 
was seeing (I pretty much get results with GNAT runnig about 2.6 times 
slower) so it was time to look at the assembly.

I find it easier to read assembly using sse math so building gnat via

gnatmake -g -f -gnatp -O3  -march=pentium4 -fomit-frame-pointer 
-mfpmath=sse tst_array

and fotran via

gfortran -O3 -g -march=pentium4 -fomit-frame-pointer -mfpmath=sse -c 
tst_array.f95

and then using
objdump -D -S tst_array.o

to look at them, you pretty quickly can see the problem.

The "inner loop" of the fortran code looks like

2d0:   8d 04 19                lea    (%ecx,%ebx,1),%eax
  2d3:   f3 0f 10 02             movss  (%edx),%xmm0
  2d7:   f3 0f 59 44 85 04       mulss  0x4(%ebp,%eax,4),%xmm0
  2dd:   f3 0f 58 c8             addss  %xmm0,%xmm1
  2e1:   83 c1 01                add    $0x1,%ecx
  2e4:   01 f2                   add    %esi,%edx
  2e6:   39 f9                   cmp    %edi,%ecx
  2e8:   75 e6                   jne    2d0 <MAIN__+0x2d0>


The "inner loop of the Ada code looks like


  af2:   83 c6 01                add    $0x1,%esi
  af5:   89 f0                   mov    %esi,%eax
  af7:   2b 44 24 28             sub    0x28(%esp),%eax
  afb:   03 44 24 30             add    0x30(%esp),%eax
  aff:   8b 5c 24 38             mov    0x38(%esp),%ebx
  b03:   f3 0f 10 0c 83          movss  (%ebx,%eax,4),%xmm1
  b08:   8b 4d 00                mov    0x0(%ebp),%ecx
  b0b:   8b 45 0c                mov    0xc(%ebp),%eax
  b0e:   8b 55 08                mov    0x8(%ebp),%edx
  b11:   8b 5c 24 78             mov    0x78(%esp),%ebx
  b15:   29 d3                   sub    %edx,%ebx
  b17:   89 f7                   mov    %esi,%edi
  b19:   29 cf                   sub    %ecx,%edi
  b1b:   89 f9                   mov    %edi,%ecx
  b1d:   83 c0 01                add    $0x1,%eax
  b20:   29 d0                   sub    %edx,%eax
  b22:   01 c0                   add    %eax,%eax
  b24:   01 c0                   add    %eax,%eax
  b26:   ba 00 00 00 00          mov    $0x0,%edx
  b2b:   0f 48 c2                cmovs  %edx,%eax
  b2e:   0f af c8                imul   %eax,%ecx
  b31:   8d 1c 99                lea    (%ecx,%ebx,4),%ebx
  b34:   8b 44 24 3c             mov    0x3c(%esp),%eax
  b38:   f3 0f 10 04 03          movss  (%ebx,%eax,1),%xmm0
  b3d:   f3 0f 59 c1             mulss  %xmm1,%xmm0
  b41:   f3 0f 58 d0             addss  %xmm0,%xmm2
  b45:   3b 74 24 7c             cmp    0x7c(%esp),%esi
  b49:   75 a7                   jne    af2 <_ada_tst_array+0x254>

28 Instructions v.s. 8 for fortran.

The GNAT version never stood a chance. It really seems like GNAT is 
dropping the ball here.

Granted small benchmarks can really lead one to believe things are 
better or worse than the truth but I don't think there is really an 
excuse in this case for this sort of performance.







^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (8 preceding siblings ...)
  2006-10-20 15:56 ` Jeffrey Creem
@ 2006-10-20 16:30 ` Martin Krischik
  2006-10-20 19:51 ` Gautier
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: Martin Krischik @ 2006-10-20 16:30 UTC (permalink / raw)


tkrauss wrote:

> Unfortunately the Ada code takes 3-4 times as long.

Well, I collected the various suggestions made and created three test
programs [1,2,3] where one is indeed faster then the next.


GNAT 4.1.1

./Linux-x86_64-Release/Test_Array_1 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:          18.615262000

./Linux-x86_64-Release/Test_Array_2 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:           5.457141000

./Linux-x86_64-Release/Test_Array_3 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:           4.509063000

GNAT GPL 2006 (20060522-34)

./Linux-x86_64-Release/Test_Array_1 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:          13.950607000

./Linux-x86_64-Release/Test_Array_2 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:           8.010260000

./Linux-x86_64-Release/Test_Array_3 800
 1.99202E+02 1.96854E+02
 1.88844E+02 1.84498E+02
Time:           7.101502000

As options I used my trusted "Release" option which you can see here [4].

Martin

[1]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/Source/test_array_1.adb?view=markup
[2]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/Source/test_array_2.adb?view=markup
[3]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/Source/test_array_3.adb?view=markup
[4]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/GNAT/wikibook_ada.gpr?view=markup

-- 
mailto://krischik@users.sourceforge.net
Ada programming at: http://ada.krischik.com



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (9 preceding siblings ...)
  2006-10-20 16:30 ` Martin Krischik
@ 2006-10-20 19:51 ` Gautier
  2006-10-20 22:11 ` Jeffrey R. Carter
  2006-10-21 12:39 ` Dr. Adrian Wrigley
  12 siblings, 0 replies; 68+ messages in thread
From: Gautier @ 2006-10-20 19:51 UTC (permalink / raw)


Just a detail (should not make much, but who knows with Ada.Text_IO): the 
timer should be stopped just after the multiplication and before any output. 
This little bug is in both Fortran and Ada code (in the Ada code the finish 
time is even taken twice).
G.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (10 preceding siblings ...)
  2006-10-20 19:51 ` Gautier
@ 2006-10-20 22:11 ` Jeffrey R. Carter
  2006-10-20 23:52   ` Jeffrey Creem
  2006-10-21 12:39 ` Dr. Adrian Wrigley
  12 siblings, 1 reply; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-20 22:11 UTC (permalink / raw)


tkrauss wrote:
> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code.  I suspect there is something simple (like
> compiler switches) but I'm missing it.  As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran.  Unfortunately the Ada code takes 3-4 times as long.

1st, make sure you're timing the same thing. How does Ada matrix 
multiplication compare to FORTRAN? You don't know, because you're also 
timing the random number generators and I/O. Ada.Text_IO, especially, is 
quite heavy-weight compared to other languages.

Here's my version of your program:

with Ada.Numerics.Float_Random;
with Ada.Command_Line;          use Ada.Command_Line;
with Ada.Text_IO;               use Ada.Text_IO;
with Ada.Calendar;              use Ada.Calendar;

procedure Tst_Array is
    package F_IO is new Ada.Text_IO.Float_IO (Float);
    package D_IO is new Ada.Text_Io.Fixed_Io (Duration);

    N : constant Positive := Integer'Value (Argument (1) );

    type Real_Matrix is array (1 .. N, 1 .. N) of Float;
    pragma Convention (FORTRAN, Real_Matrix);

    G : Ada.Numerics.Float_Random.Generator;

    A,B : Real_Matrix :=
       (others => (others => Ada.Numerics.Float_Random.Random (G) ) );
    C : Real_Matrix := (others => (others => 0.0) );
    Start, Finish : Ada.Calendar.Time;
begin
    Start := Ada.Calendar.Clock;

    for I in A'range (1) loop
       for J in A'range (2) loop
          for R in A'range (2) loop
             C (I, J) := C (I, J) + A (I, R) * B (R, J);
          end loop;
       end loop;
    end loop;

    Finish := Ada.Calendar.Clock;

    F_IO.Put (C (1, 1) );
    F_IO.Put (C (1, 2) );
    New_Line;
    F_IO.Put (C (2, 1) );
    F_IO.Put (C (2, 2) );
    New_Line;

    Put ("Time: ");
    D_IO.Put (Finish - Start);
    New_Line;
end Tst_Array;

I compiled as

gnatmake -O3 -gnatnp -fomit-frame-pointer tst_array

With MinGW GNAT 3.4.2, Windows XP, and a 3.2 GHz Pentium 4 HT processor, 
I get

C:\Documents and Settings\All Users\Documents\Code>tst_array 800
  2.05839E+02 2.01230E+02
  2.00866E+02 1.94039E+02
Time:           5.821082831

I don't have a FORTRAN compiler for comparison. FORTRAN will probably do 
better, but I wouldn't expect the difference to be as great as your values.

-- 
Jeff Carter
"I was hobbling along, minding my own business, all of a
sudden, up he comes, cures me! One minute I'm a leper with
a trade, next minute my livelihood's gone! Not so much as a
'by your leave!' You're cured, mate. Bloody do-gooder!"
Monty Python's Life of Brian
76



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 22:11 ` Jeffrey R. Carter
@ 2006-10-20 23:52   ` Jeffrey Creem
  2006-10-21  7:37     ` Gautier
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-20 23:52 UTC (permalink / raw)

Jeffrey R. Carter wrote:
> tkrauss wrote:
> 
>> I'm a bit stuck trying to figure out how to coax more performance
>> out of some Ada code.  I suspect there is something simple (like
>> compiler switches) but I'm missing it.  As an example I'm using
>> a simple matrix multiply and comparing it to similar code in
>> Fortran.  Unfortunately the Ada code takes 3-4 times as long.
> 
> 
> 1st, make sure you're timing the same thing. How does Ada matrix 
> multiplication compare to FORTRAN? You don't know, because you're also 
> timing the random number generators and I/O. Ada.Text_IO, especially, is 
> quite heavy-weight compared to other languages.
> 
> Here's my version of your program:
> 

Note, I am the first one to jump to the defense of "Ada" in general but 
in this case, GNAT just plain sucks compared to GNU FORTRAN as it does a 
poor job on (at least) the inner loop (verified by looking at the output 
  assembly)

Jeff's (the other jeff :) modified version looks a little cleaner and 
actually runs faster (than even old "fixed version" that did not time 
the IO and made sure to just time the matrix multiply in both versions) 
but it does not time the zeroing of the elements of C which would be 
required if this were a real matrix multiply routine and not some test 
driver.

However, even having said that, this not really equivilent version runs 
about 2x slower than the FORTRAN (with the same version of GCC)

I don't see any meaningful excuse for why GNAT should be slower in this 
case but it clearly is.

I tried looking at the tree dump generated by the front ends prior to 
going to the optimizer step (not something I have a lot of experience 
at) . One thing is clear is the trees generated by GNAT is quite a bit 
uglier and more verbose so it is not surprising that the optimizer is 
unable to fully clean things up resulting in the explosion of 
instructions at the assembly level.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 23:52   ` Jeffrey Creem
@ 2006-10-21  7:37     ` Gautier
  2006-10-21 16:35       ` Jeffrey Creem
  2006-10-23  6:28       ` Martin Krischik
  0 siblings, 2 replies; 68+ messages in thread
From: Gautier @ 2006-10-21  7:37 UTC (permalink / raw)


Jeffrey Creem:

> Note, I am the first one to jump to the defense of "Ada" in general but 
> in this case, GNAT just plain sucks compared to GNU FORTRAN as it does a 
> poor job on (at least) the inner loop (verified by looking at the output 
> assembly)

There is something strange... Martin Krischik was able to trim the overall 
time for the Ada code down to 24% of the first version (GNAT/GCC 4.1.1).
This should make the Ada program as fast as the FORTRAN one, shouldn't it ?
Maybe it's because the test is done on a 64 bit machine ?
It needs some reconciliation...
A good thing in that discussion would be that everybody shows each time
- which GCC version
- which machine
- the execution time of the multiplication for both Ada and Fortran
- which version of the Ada code (matrix on stack/heap, Fortran or Ada convention)

Cheers, Gautier
______________________________________________________________
Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm

NB: For a direct answer, e-mail address on the Web site!



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-21  7:37     ` Gautier
@ 2006-10-21 16:35       ` Jeffrey Creem
  2006-10-21 17:04         ` Pascal Obry
  2006-10-21 18:28         ` tmoran
  2006-10-23  6:28       ` Martin Krischik
  1 sibling, 2 replies; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-21 16:35 UTC (permalink / raw)

Gautier wrote:
> Jeffrey Creem:
> 
>> Note, I am the first one to jump to the defense of "Ada" in general 
>> but in this case, GNAT just plain sucks compared to GNU FORTRAN as it 
>> does a poor job on (at least) the inner loop (verified by looking at 
>> the output assembly)
> 
> 
> There is something strange... Martin Krischik was able to trim the 
> overall time for the Ada code down to 24% of the first version (GNAT/GCC 
> 4.1.1).
> This should make the Ada program as fast as the FORTRAN one, shouldn't it ?
> Maybe it's because the test is done on a 64 bit machine ?
> It needs some reconciliation...
> A good thing in that discussion would be that everybody shows each time
> - which GCC version
> - which machine
> - the execution time of the multiplication for both Ada and Fortran
> - which version of the Ada code (matrix on stack/heap, Fortran or Ada 
> convention)
> 
> Cheers, Gautier
> ______________________________________________________________
> Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm
> 
> NB: For a direct answer, e-mail address on the Web site!

I'd certainly be willing to run a few benchmarks but the important thing 
here is that rather innocent looking code is running 2-4x slower than it 
  "should".

There are things that I think we can really rule out as being "the" factor.

1) Random number generator - I did timings (for both the Ada and 
FORTRAN) with timing moved to only cover matrix multiply.
2) Difference GCC versions - I built a fresh GCC from the GCC trunk for 
both Ada and FORTRAN
3) The Machine - I am running both on the same machine, though I suppose 
there could be differences in 32 bit v.s. 64 bit comparisons.
4) Runtime checks - both the original author (and I) ran with checks 
suppressed
5) O2/O3 - Actually, I could look at this some more with some other 
versions but a quick look when I first started seemed to indicate this 
was not the issue.

A few other thoughts.

Once the timing is limited to just the matrix multiply the stack/heap 
thing 'should' generally not matter.

Some of the changes made to the Ada version make it not really the same 
program as the FORTRAN version and the same changes made to the FORTRAN 
one would also cause it to speed up (e.g. not counting the the zeroing 
of the target array during the accumulation phase).

I have certainly seen some amazing performance from some Ada compiler 
sin the past and in general, on non-trivial benchmarks I am usually 
pretty happy with the output of GNAT as well but in this case it is not 
great.

Further, I tried playing a bit with the new autovectorization capability 
of the near 4.X series of GCC (has to be specifically enabled) and found 
that even very very trivial cases would refuse to vectorize under Ada 
(though after I submitted the bug report to GCC, I found that FORTRAN 
fails to vectorize these too).

One thing everyone needs to remember is that this example was (probably) 
not "Find the way to get the smallest value out of this test program" 
becuase there are always ways of doing some tweaks to a small enough 
region of code to make it better. If there is a 2-4x global slowdown in 
your 100KLOC program, you will never "get there" following the 
conventional wisdom of profiling and looking for the problems.

Now, I am not suggesting that GNAT is globally 2-4x slower than GFORTRAN 
or anything like that (since that does not line up with what I have 
generally seen on larger code bases), but, if I were a manager picking a 
new language based on a set of long term goals for a project and saw 
that GNAT was running 2-4x slower and was still runninging 1.X to 3X 
slower after 2 days of Ada guru's looking at it, I'd probably jettison 
Ada (I know, I am mixing compilers and languages here, but in reality, 
that is what happens in the real world) and go with something else.

And before the chorus of "processors are so fast that performance does 
not matter as much as safety and correctness" crowd starts getting too 
loud, let me point out that there are still many segments of the 
industry where performance does still indeed matter. Especially when one 
is trading adding a second processor to an embedded box against a vague 
promise of "betterness" in terms of safety down the road....Ok..Off the 
soapbox.

So, in closing, if someone thinks they have "the best" version of that 
program they want timed against gfortran, post it here and I'll run them.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-21 16:35       ` Jeffrey Creem
@ 2006-10-21 17:04         ` Pascal Obry
  2006-10-21 21:22           ` Jeffrey Creem
  2006-10-21 18:28         ` tmoran
  1 sibling, 1 reply; 68+ messages in thread
From: Pascal Obry @ 2006-10-21 17:04 UTC (permalink / raw)
  To: Jeffrey Creem

Jeffrey Creem a ï¿½crit :

> I'd certainly be willing to run a few benchmarks but the important thing
> here is that rather innocent looking code is running 2-4x slower than it
>  "should".

The first thing is to be sure that we are running the same program.

Running your program with the following changes (as done in Fortran):

1. Using Sum tmp var for the computation

   for I in A'range (1) loop
      for J in A'range (2) loop
         Sum := 0.0;
         for R in A'range (2) loop
            Sum := Sum + A (I, R) * B (R, J);
         end loop;
         C (I, J) := Sum;
      end loop;
   end loop;

2. Using Long_Float instead of Float (I think Fortran float is a
Long_Float, to be checked).

I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2).

Pascal.

-- 

--|------------------------------------------------------
--| Pascal Obry                           Team-Ada Member
--| 45, rue Gabriel Peri - 78114 Magny Les Hameaux FRANCE
--|------------------------------------------------------
--|              http://www.obry.net
--| "The best way to travel is by means of imagination"
--|
--| gpg --keyserver wwwkeys.pgp.net --recv-key C1082595



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-21 17:04         ` Pascal Obry
@ 2006-10-21 21:22           ` Jeffrey Creem
  2006-10-22  3:03             ` Jeffrey Creem
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-21 21:22 UTC (permalink / raw)


Pascal Obry wrote:
> Jeffrey Creem a ï¿½crit :
> 
> 
>>I'd certainly be willing to run a few benchmarks but the important thing
>>here is that rather innocent looking code is running 2-4x slower than it
>> "should".
> 
> 
> The first thing is to be sure that we are running the same program.
> 
> Running your program with the following changes (as done in Fortran):
> 
> 1. Using Sum tmp var for the computation
> 
>    for I in A'range (1) loop
>       for J in A'range (2) loop
>          Sum := 0.0;
>          for R in A'range (2) loop
>             Sum := Sum + A (I, R) * B (R, J);
>          end loop;
>          C (I, J) := Sum;
>       end loop;
>    end loop;
> 
> 2. Using Long_Float instead of Float (I think Fortran float is a
> Long_Float, to be checked).
> 
> I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2).
> 
> Pascal.
> 

Actually, the original poster's Ada program had the temp var and all of 
my comparisons of programs that I have asserted were "the same" used the 
temporary.

As for Long_Float v.s. Short_Float, gfortran is using 32 bit floats (as 
verified by dumping the tree representation and assembly language).

Since we are all getting a bit confused by specific versions and numbers 
I thought I'd post a summary and create a way of tracking more global 
performance issues at the gnuada.sf.net wiki.

The direct link to these test results are here:

http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison






^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-21 21:22           ` Jeffrey Creem
@ 2006-10-22  3:03             ` Jeffrey Creem
  2006-10-22  7:39               ` Jeffrey R. Carter
                                 ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-22  3:03 UTC (permalink / raw)


Jeffrey Creem wrote:
> Pascal Obry wrote:
> 
>> Jeffrey Creem a ï¿½crit :
>>
>>
>>> I'd certainly be willing to run a few benchmarks but the important thing
>>> here is that rather innocent looking code is running 2-4x slower than it
>>> "should".
>>
>>
>>
>> The first thing is to be sure that we are running the same program.
>>
>> Running your program with the following changes (as done in Fortran):
>>
>> 1. Using Sum tmp var for the computation
>>
>>    for I in A'range (1) loop
>>       for J in A'range (2) loop
>>          Sum := 0.0;
>>          for R in A'range (2) loop
>>             Sum := Sum + A (I, R) * B (R, J);
>>          end loop;
>>          C (I, J) := Sum;
>>       end loop;
>>    end loop;
>>
>> 2. Using Long_Float instead of Float (I think Fortran float is a
>> Long_Float, to be checked).
>>
>> I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2).
>>
>> Pascal.
>>
> 
> Actually, the original poster's Ada program had the temp var and all of 
> my comparisons of programs that I have asserted were "the same" used the 
> temporary.
> 
> As for Long_Float v.s. Short_Float, gfortran is using 32 bit floats (as 
> verified by dumping the tree representation and assembly language).
> 
> Since we are all getting a bit confused by specific versions and numbers 
> I thought I'd post a summary and create a way of tracking more global 
> performance issues at the gnuada.sf.net wiki.
> 
> The direct link to these test results are here:
> 
> http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison
> 
> 
> 



Actually, as a result of this, I submitted a bug report to the GCC 
bugzilla list. You can follow progress on it here:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29543

Interesting initial feedback is that
1) Not an Ada bug.
2) Is a FORTRAN bug
3) Is a backend limitation of the optimizer.

Of course, the FORTRAN one still runs correctly so I don't think most 
users will care that it is because of a bug :)



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22  3:03             ` Jeffrey Creem
@ 2006-10-22  7:39               ` Jeffrey R. Carter
  2006-10-22 11:48                 ` tkrauss
                                   ` (2 more replies)
  2006-10-22 13:50               ` Alinabi
  2006-10-22 15:57               ` Jeffrey Creem
  2 siblings, 3 replies; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-22  7:39 UTC (permalink / raw)


Jeffrey Creem wrote:
> 
> Actually, as a result of this, I submitted a bug report to the GCC 
> bugzilla list. You can follow progress on it here:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29543
> 
> Interesting initial feedback is that
> 1) Not an Ada bug.
> 2) Is a FORTRAN bug
> 3) Is a backend limitation of the optimizer.
> 
> Of course, the FORTRAN one still runs correctly so I don't think most 
> users will care that it is because of a bug :)

Interesting. I've been experimenting with some variations simply out of 
curiosity and found some things that seem a bit strange. (All results 
for an argument of 800.)

Adding the Sum variable makes an important difference, as others have 
reported, in my case from 5.82 to 4.38 s. Hoisting the indexing 
calculation for the result (C) matrix location is a basic optimization, 
and I would be surprised if it isn't done. The only thing I can think of 
is that it's a cache issue: that all 3 matrices can't be kept in cache 
at once. Perhaps compiler writers would be able to make sense of this.

Previously, I found no difference between -O2 and -O3. With this change, 
-O2 is faster.

The issue of using 'range compared to using "1 .. N" makes no difference 
in my version of the program.

Something I found really surprising is that putting the multiplication 
in a procedure makes the program faster, down to 4.03 s. I have no idea 
why this would be so.

Compiled with MinGW GNAT 3.4.2, -O2, -gnatnp -fomit-frame-pointer. Run 
under Windows XP SP2 on a 3.2 GHz Pentium 4 HT with 1 GB RAM.

Here's the code:

with Ada.Numerics.Float_Random;
with Ada.Command_Line;          use Ada.Command_Line;
with Ada.Text_IO;               use Ada.Text_IO;
with Ada.Calendar;              use Ada.Calendar;

procedure Tst_Array is
    package F_IO is new Ada.Text_IO.Float_IO (Float);
    package D_IO is new Ada.Text_Io.Fixed_Io (Duration);

    N : constant Positive := Integer'Value (Argument (1) );

    type Real_Matrix is array (1 .. N, 1 .. N) of Float;
    pragma Convention (FORTRAN, Real_Matrix);

    G : Ada.Numerics.Float_Random.Generator;

    A,B : Real_Matrix :=
       (others => (others => Ada.Numerics.Float_Random.Random (G) ) );
    C : Real_Matrix := (others => (others => 0.0) );
    Start, Finish : Ada.Calendar.Time;

    procedure Multiply is
       Sum : Float;
    begin -- Multiply
       All_Rows : for Row in A'range (1) loop
          All_Columns : for Column in B'range (2) loop
             Sum := 0.0;

             All_Common : for R in A'range (2) loop
                Sum := Sum + A (Row, R) * B (R, Column);
             end loop All_Common;

             C (Row, Column) := Sum;
          end loop All_Columns;
       end loop All_Rows;
    end Multiply;
begin
    Start := Ada.Calendar.Clock;
    Multiply;
    Finish := Ada.Calendar.Clock;

    F_IO.Put (C (1, 1) );
    F_IO.Put (C (1, 2) );
    New_Line;
    F_IO.Put (C (2, 1) );
    F_IO.Put (C (2, 2) );
    New_Line;

    Put ("Time: ");
    D_IO.Put (Finish - Start);
    New_Line;
end Tst_Array;

Next, since there have been reported some meaningful speed-up of quick 
sort on a Pentium 4 HT processor by using 2 tasks, I thought I'd see 
what effect that had. With 2 tasks, I got a time of 3.70 s. That's not a 
significant speed up, about 9.1%.

Same compilation options and platform.

Here's that code:

with Ada.Numerics.Float_Random;
with Ada.Command_Line;          use Ada.Command_Line;
with Ada.Text_IO;               use Ada.Text_IO;
with Ada.Calendar;              use Ada.Calendar;

procedure Tst_Array is
    package F_IO is new Ada.Text_IO.Float_IO (Float);
    package D_IO is new Ada.Text_Io.Fixed_Io (Duration);

    N : constant Positive := Integer'Value (Argument (1) );

    type Real_Matrix is array (1 .. N, 1 .. N) of Float;
    pragma Convention (FORTRAN, Real_Matrix);

    G : Ada.Numerics.Float_Random.Generator;

    A, B : Real_Matrix :=
       (others => (others => Ada.Numerics.Float_Random.Random (G) ) );
    C : Real_Matrix := (others => (others => 0.0) );
    Start, Finish : Ada.Calendar.Time;

    procedure Multiply is
       procedure Multiply
          (Start_Row : in Positive; Stop_Row : in Positive)
       is
          Sum : Float;
       begin -- Multiply
          All_Rows : for Row in Start_Row .. Stop_Row loop
             All_Columns : for Column in B'range (2) loop
                Sum := 0.0;

                All_Common : for R in A'range (2) loop
                   Sum := Sum + A (Row, R) * B (R, Column);
                end loop All_Common;

                C (Row, Column) := Sum;
             end loop All_Columns;
          end loop All_Rows;
       end Multiply;

       task type Multiplier (Start_Row : Positive; Stop_Row : Positive);

       task body Multiplier is
          -- null;
       begin -- Multiplier
          Multiply (Start_Row => Start_Row, Stop_Row => Stop_Row);
       end Multiplier;

       Stop  : constant Positive := N / 2;
       Start : constant Positive := Stop + 1;

       Mult : Multiplier (Start_Row => 1, Stop_Row => Stop);
    begin -- Multiply
       Multiply (Start_Row => Start, Stop_Row => N);
    end Multiply;
begin
    Start := Ada.Calendar.Clock;
    Multiply;
    Finish := Ada.Calendar.Clock;

    F_IO.Put (C (1, 1) );
    F_IO.Put (C (1, 2) );
    New_Line;
    F_IO.Put (C (2, 1) );
    F_IO.Put (C (2, 2) );
    New_Line;

    Put ("Time: ");
    D_IO.Put (Finish - Start);
    New_Line;
end Tst_Array;

If I inline the inner Multiply, or put equivalent code in the task and 
the outer Mutliply, the time is much more than for the sequential 
version, presumably due to cache effects.

Since it appears you have 2 physical processors ("Dual Xeon 2.8 Ghz"), I 
would be interested in seeing what effect this concurrent version has on 
that platform. I also wonder how easy such a version would be to create 
in FORTRAN.

-- 
Jeff Carter
"Ada has made you lazy and careless. You can write programs in C that
are just as safe by the simple application of super-human diligence."
E. Robert Tisdale
72



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22  7:39               ` Jeffrey R. Carter
@ 2006-10-22 11:48                 ` tkrauss
  2006-10-22 18:02                   ` Georg Bauhaus
  2006-10-22 20:20                   ` Jeffrey R. Carter
  2006-10-22 12:31                 ` Gautier
  2006-10-22 18:01                 ` tmoran
  2 siblings, 2 replies; 68+ messages in thread
From: tkrauss @ 2006-10-22 11:48 UTC (permalink / raw)


> If I inline the inner Multiply, or put equivalent code in the task and
> the outer Mutliply, the time is much more than for the sequential
> version, presumably due to cache effects.
>
> Since it appears you have 2 physical processors ("Dual Xeon 2.8 Ghz"), I
> would be interested in seeing what effect this concurrent version has on
> that platform. I also wonder how easy such a version would be to create
> in FORTRAN.


It's funny that you mention tasking since that's what got me started on
this in the first place.  I was introduced to OpenMP as a simple method
of parallelizing code in fortran.  Unfortunately it seems to be
"tricky" to really get things in parallel (what data is shared,
multiple tasks doing the same thing with the same memory, access
violations, etc.)  I remembered that Ada had tasking built in so I
started playing with that.  Now, as you can probably tell from my code,
I haven't touched Ada in a _very_ long time, but it was surprisingly
easy set up a simple two-task test program.

Anyway, using the latest code from Jeffrey Creem it looks like the
execution time (on my machine)  has been cut in half (9 seconds) .  The
threaded version runs in nearly the same time for smaller problems but
dies with a stack overflow for larger.  I see a comment in the Bugzilla
recommending a similar construct
                     type Real_Matrix is array (1 .. N, 1 .. N) of
Float;
That takes the memory from the stack rather than the heap though, no?
I assume there is a compiler switch to increase the stack size so the
code wouldn't die, but is that the "normal" way of allocating
memory?  I'm trying to not look like _too_ much of an Ada neophyte :)




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 11:48                 ` tkrauss
@ 2006-10-22 18:02                   ` Georg Bauhaus
  2006-10-22 18:24                     ` Jeffrey Creem
  2006-10-22 20:20                   ` Jeffrey R. Carter
  1 sibling, 1 reply; 68+ messages in thread
From: Georg Bauhaus @ 2006-10-22 18:02 UTC (permalink / raw)

tkrauss wrote:

> That takes the memory from the stack rather than the heap though, no?
> I assume there is a compiler switch to increase the stack size so the
> code wouldn't die, but is that the "normal" way of allocating
> memory?  I'm trying to not look like _too_ much of an Ada neophyte :)
> 

I just tested a similar thing using two different compilers.
A function that uses a local array of a size determined by
a function parameter. The result was that

- GNAT may trigger constraint errors when the array size
  exceeds some limit. The documentation tells the compiler
  user that this is an operating system issue and should
  be addressed at this level IIUC. (Does this comment
  apply in this case?)

- ObjectAda had no problem with local arrays of the same
  tens of millions of components.

I've measured performance by calling the function many times with
different Size parameters for the array. This has consistently shown
that on this platform, GNAT code takes about at least twice as
much time as OA code. (The time is not affected by -O?.)

Using an allocator "new Big_Array(1 .. Size)" and an instance
of Unchecked_Deallocation in place of simple variable allocation
has two effects.

- The execution time of ObjectAda code almost does not change.
This lets me suspect that OA isn't necessarily using the
primary stack for local variables. Indeed, the assembly
listing shows calls on rts_ss_allocate (for the local array variable)
versus rts_allocate (for an array allocator). The RTS function
rts_ss_allocate is the link_name of the function
System.RTS.TGT.Sec_Stack_Pkg.Allocate from the ObjectAda RTS.

- The "GNAT heap version" performs about 2-3 times faster than
the "GNAT stack version".

(GNAT code that uses "new" instead of a simple local variable
performs minimally to just slightly faster than the OA code using
"new".)

So there is a major difference in execution speed of programs that
compute larger matrices if you use GNAT and simple array variables,
but not allocators.
Execution speed does not differ significantly using at least one
other compiler (ObjectAda).

I think that's a pity when seen from a "just Ada" perspective. There
is one more thing (allocators) you have to think about if you
want speed for local array variables when using GNAT.

Windows x86, GNAT GPL 2006, and OA 8.2 U15. The first, middle 
and last element of the array are referenced. Pragma volatile
has been tried, too.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 18:02                   ` Georg Bauhaus
@ 2006-10-22 18:24                     ` Jeffrey Creem
  2006-10-23  0:10                       ` Georg Bauhaus
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-22 18:24 UTC (permalink / raw)

Georg Bauhaus wrote:

> 
> I think that's a pity when seen from a "just Ada" perspective. There
> is one more thing (allocators) you have to think about if you
> want speed for local array variables when using GNAT.
> 
> 
> Windows x86, GNAT GPL 2006, and OA 8.2 U15. The first, middle and last 
> element of the array are referenced. Pragma volatile
> has been tried, too.

It sounds like you are running some different code and I'd be hestitant 
to make any assertions about runtime for certain constructs without 
seeing it since

1) We just spent 3 days talking about the code that started this thread 
and there have been all sorts of assertions about things being 
faster/slower,etc.

2) You mention something about just accessing first, middle and last of 
your arrays so it really sounds like you really are just timing 
allocations and and not actually really hitting the array indexing 
(though hard to say without seeing the code).

Don't take either of those as any sort of severe criticism but I am just 
trying to make sure that any discussions of performance are tied to 
clear,open and repeatable measurements so we all know what each other 
are talkig about.

The problem is that too often, performance critisim is done via vague 
references rather than hard facts and thus they get dismissed 
(sometimes) by some compiler writers.

The bug submitted on this issue (whether or not it ever gets fixed) 
certainly has generated real and useful discussions amoung the 
developers with few of the typical 'silly user' responses that we 
sometimes see.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 18:24                     ` Jeffrey Creem
@ 2006-10-23  0:10                       ` Georg Bauhaus
  0 siblings, 0 replies; 68+ messages in thread
From: Georg Bauhaus @ 2006-10-23  0:10 UTC (permalink / raw)


On Sun, 2006-10-22 at 14:24 -0400, Jeffrey Creem wrote:

> It sounds like you are running some different code and I'd be hestitant 
> to make any assertions about runtime for certain constructs without 
> seeing it since

It is appended below.

> 2) You mention something about just accessing first, middle and last of 
> your arrays so it really sounds like you really are just timing 
> allocations and and not actually really hitting the array indexing 
> (though hard to say without seeing the code).

Yes, I had concentrated on just this question, asked by Thomas Krauss,
about stack versus heap allocation. If a program is not frequently
allocating and freeing arrays (or matrices, etc.), the the effects
might be less of an issue, if they are an issue at all when there
is no initialization.
But I imagined allocation is just what is happening all the time
if you have a function that computes matrices of varying sizes?
 OTOH, I find very different results on GNU/Linux (for which I only
have a number of GNATs). Stack allocation appears to be consuming
next to no time. (Commit on write? Just remembering an offset?)

Below the following program, there is a patch that adds trivial
initialization loops. With them, a comparison of stack vs heap
on GNU/Linux seems in favor of the stack. I can't check this on
Windows right now, though.

with Ada.Calendar;
with Ada.Text_IO;
with Ada.Unchecked_Deallocation;

procedure main is
   use Ada.Calendar, Ada;

   type LIST is array (NATURAL range <>) of BOOLEAN;
   for LIST'component_size use 8;
      -- for a "normal" array, not packed or other magic

   type LIST_PTR is access LIST;
   procedure free is new Ada.Unchecked_Deallocation
      (LIST, LIST_PTR);

   accu: BOOLEAN := false;
      -- The allocating functions read and write this variable
      -- using components of the local arrays.
      -- (This should prevent some optimizations.)

   function allocate(size: POSITIVE) return BOOLEAN;
     -- use a local `LIST` of length `size`

   function allocate_heap(size: POSITIVE) return BOOLEAN;
      -- use a pointer to a new `LIST` of length `size`

   function image(t: TIME) return STRING;
      -- the current time as a `STRING` value
   

   function allocate(size: POSITIVE) return BOOLEAN is
         done: LIST(1 .. size);  -- pragma volatile(done);
         result: BOOLEAN;
      begin
         if done(size / 2) then
            result := false;
         end if;
         done(done'last) := accu and done(done'first);
         result := done(done'last) and done(done'first);
         return result;
      end allocate;

   function allocate_heap(size: POSITIVE) return BOOLEAN is
         done: LIST_PTR;
         result: BOOLEAN;
      begin
         done := new LIST(1 .. size);
         if done(size / 2) then
            result := false;
         end if;
         done(done'last) := accu and done(done'first);
         result := done(done'first) and done(done'first);
         Free(done);
         return result;
      end allocate_heap;

   function image(t: TIME) return STRING is
         year: YEAR_NUMBER;
         day: DAY_NUMBER;
         month: MONTH_NUMBER;
         sec: DURATION;
      begin
         split(t, year, month, day, sec);
         return YEAR_NUMBER'image(year)
           & MONTH_NUMBER'image(month)
           & DAY_NUMBER'image(day)
           & DURATION'image(sec);
      end image;


   start, finish: TIME;
   

begin
   Text_IO.put_line("using a stack");
   start := clock;
   for run in 1 .. 25 loop
      for k in 1 .. 2000 loop
         accu := allocate(5000 * k);
      end loop;
   end loop;
   finish := clock;
   
   Text_IO.put_line("from" & image(start)
         & " to" & image(finish)
         & " = " & DURATION'image(finish - start));
   Text_IO.put_line("accu " & BOOLEAN'image(accu));


   Text_IO.put_line("using a heap");
   start := clock;
   for run in 1 .. 25 loop
      for k in 1 .. 2000 loop
         accu := allocate_heap(5000 * k);
      end loop;
   end loop;
   finish := clock;

   Text_IO.put_line("from" & image(start)
         & " to" & image(finish)
         & " = " & DURATION'image(finish - start));
   Text_IO.put_line("accu " & BOOLEAN'image(accu));
end main;



--- stack_use_testing.ada       2006/10/22 23:51:48     1.1
+++ stack_use_testing.ada       2006/10/22 23:52:12
@@ -33,2 +33,3 @@
       begin
+         for k in done'range loop done(k) := false; end loop;
          if done(size / 2) then
@@ -46,2 +47,3 @@
          done := new LIST(1 .. size);
+         for k in done'range loop done(k) := false; end loop;
          if done(size / 2) then
@@ -75,4 +77,4 @@
    start := clock;
-   for run in 1 .. 25 loop
-      for k in 1 .. 2000 loop
+   for run in 1 .. 2 loop
+      for k in 1 .. 200 loop
          accu := allocate(5000 * k);
@@ -90,4 +92,4 @@
    start := clock;
-   for run in 1 .. 25 loop
-      for k in 1 .. 2000 loop
+   for run in 1 .. 2 loop
+      for k in 1 .. 200 loop
          accu := allocate_heap(5000 * k);





^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 11:48                 ` tkrauss
  2006-10-22 18:02                   ` Georg Bauhaus
@ 2006-10-22 20:20                   ` Jeffrey R. Carter
  1 sibling, 0 replies; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-22 20:20 UTC (permalink / raw)


tkrauss wrote:
> 
> Anyway, using the latest code from Jeffrey Creem it looks like the
> execution time (on my machine)  has been cut in half (9 seconds) .  The
> threaded version runs in nearly the same time for smaller problems but
> dies with a stack overflow for larger.  I see a comment in the Bugzilla
> recommending a similar construct
>                      type Real_Matrix is array (1 .. N, 1 .. N) of
> Float;
> That takes the memory from the stack rather than the heap though, no?
> I assume there is a compiler switch to increase the stack size so the
> code wouldn't die, but is that the "normal" way of allocating
> memory?  I'm trying to not look like _too_ much of an Ada neophyte :)

With GNAT, at least, they're allocated on the stack. AFAIKT, the 
language doesn't require that, but it's a common approach. The main 
stack size is probably an OS thing.

-- 
Jeff Carter
"I unclog my nose towards you."
Monty Python & the Holy Grail
11



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22  7:39               ` Jeffrey R. Carter
  2006-10-22 11:48                 ` tkrauss
@ 2006-10-22 12:31                 ` Gautier
  2006-10-22 20:26                   ` Jeffrey R. Carter
  2006-10-22 18:01                 ` tmoran
  2 siblings, 1 reply; 68+ messages in thread
From: Gautier @ 2006-10-22 12:31 UTC (permalink / raw)


Jeffrey R. Carter:

> Adding the Sum variable makes an important difference, as others have 
> reported, in my case from 5.82 to 4.38 s. Hoisting the indexing 
> calculation for the result (C) matrix location is a basic optimization, 
> and I would be surprised if it isn't done. The only thing I can think of 
> is that it's a cache issue: that all 3 matrices can't be kept in cache 
> at once. Perhaps compiler writers would be able to make sense of this.

The Sum variable was *removed* by someone at some point of the discussion in 
order to challenge a bit more the Ada compiler's optimizer.
If you replace a good algorithm by a bad one, don't be surprised that the 
program is slow. At some point of bad coding the best code optimizer won't be 
able to help you.
Eventually the optimizer will transform this:
          for R in A'range (2) loop
             C (I, J) := C (I, J) + A (I, R) * B (R, J);
          end loop;
into something like:
          Sum:= C (I, J); -- hopefully a Sum is mapped to a register
          for R in A'range (2) loop
             Sum := Sum + A (I, R) * B (R, J);
          end loop;
          C (I, J):= Sum;
but in that case it probably won't be able to guess that the C(I,J) was zeroed 
before and replace the first line by:
          Sum:= 0.0;
sparing the reading of C(I,J) (must cost much time...).
If you are luckier, the optimizer will do
          Sum:= 0.0;
          for R in A'range (2) loop
             Sum := Sum + A (I, R) * B (R, J);
          end loop;
          C (I, J):= C (I, J) + Sum;
But still, it won't spare the time lost to fill the C matrix with zeros.
If you want to do a benchmark with Fortran, it's really not a good idea to 
begin with "pessimizing" the Ada code.

Cheers

Gautier



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 12:31                 ` Gautier
@ 2006-10-22 20:26                   ` Jeffrey R. Carter
  2006-10-22 21:22                     ` Gautier
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-22 20:26 UTC (permalink / raw)


Gautier wrote:
> 
> The Sum variable was *removed* by someone at some point of the 
> discussion in order to challenge a bit more the Ada compiler's optimizer.
> If you replace a good algorithm by a bad one, don't be surprised that 
> the program is slow. At some point of bad coding the best code optimizer 
> won't be able to help you.

I did that in my 1st version. I wanted to see if the optimizer would 
result in equivalent code. No such luck.

> Eventually the optimizer will transform this:
>          for R in A'range (2) loop
>             C (I, J) := C (I, J) + A (I, R) * B (R, J);
>          end loop;
> into something like:
>          Sum:= C (I, J); -- hopefully a Sum is mapped to a register
>          for R in A'range (2) loop
>             Sum := Sum + A (I, R) * B (R, J);
>          end loop;
>          C (I, J):= Sum;
> but in that case it probably won't be able to guess that the C(I,J) was 
> zeroed before and replace the first line by:
>          Sum:= 0.0;

Thanks for the discussion.

The initialization of C is static, so a good optimizer could. They're 
hard to find, though.

> sparing the reading of C(I,J) (must cost much time...).
> If you are luckier, the optimizer will do
>          Sum:= 0.0;
>          for R in A'range (2) loop
>             Sum := Sum + A (I, R) * B (R, J);
>          end loop;
>          C (I, J):= C (I, J) + Sum;
> But still, it won't spare the time lost to fill the C matrix with zeros.

I didn't include that in the timing.

> If you want to do a benchmark with Fortran, it's really not a good idea 
> to begin with "pessimizing" the Ada code.

I'm more interested in seeing what makes a difference in the Ada. In 
this case, the high-level features that let you write less code.

-- 
Jeff Carter
"I unclog my nose towards you."
Monty Python & the Holy Grail
11



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 20:26                   ` Jeffrey R. Carter
@ 2006-10-22 21:22                     ` Gautier
  0 siblings, 0 replies; 68+ messages in thread
From: Gautier @ 2006-10-22 21:22 UTC (permalink / raw)


Jeffrey R. Carter:

>> But still, it won't spare the time lost to fill the C matrix with zeros.
> 
> I didn't include that in the timing.

But you have to! Otherwise you are cheating. The zeroing is a part of your 
changed algorithm; without zeroing the result is wrong.

>> If you want to do a benchmark with Fortran, it's really not a good 
>> idea to begin with "pessimizing" the Ada code.
> 
> I'm more interested in seeing what makes a difference in the Ada. In 
> this case, the high-level features that let you write less code.

It is less code... "on the paper". In fact, it is more work, or, at best, if 
the code optimizer is really smart enough to combine the ":= (others => 
(others => 0.0) );" with the loops, it is the same work. Maybe that's the 
problem with high-level features: one can zero a matrix in so few characters 
that it seems to be done instantly ;-).

Cheers
Gautier



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22  7:39               ` Jeffrey R. Carter
  2006-10-22 11:48                 ` tkrauss
  2006-10-22 12:31                 ` Gautier
@ 2006-10-22 18:01                 ` tmoran
  2006-10-22 20:54                   ` Jeffrey R. Carter
  2 siblings, 1 reply; 68+ messages in thread
From: tmoran @ 2006-10-22 18:01 UTC (permalink / raw)


> ...  the time is much more than for the sequential
> version, presumably due to cache effects.
   Each 800x800 Float matrix is about 2.5 megabytes so I would expect two
threads to be fighting each other over the cache.  What are the single vs
dual threaded times for, say, 250x250 arrays and a 1 MB cache on your
machine, and on the dual Xeon machine?
   Curiously, on a 3 GHz Pentium 4 with 1 MB cache and compiling
with -O2 -gnatp using the venerable gnat 3.15p, I get
Time:           8.511855190       800 1 thread
Time:           5.682686332       800 2 threads
for a speedup of 33%
Time:           0.092724434       200 1 thread
Time:           0.059209520       200 2 threads
for a speedup of 36%



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 18:01                 ` tmoran
@ 2006-10-22 20:54                   ` Jeffrey R. Carter
  0 siblings, 0 replies; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-22 20:54 UTC (permalink / raw)


tmoran@acm.org wrote:
>    Each 800x800 Float matrix is about 2.5 megabytes so I would expect two
> threads to be fighting each other over the cache.  What are the single vs
> dual threaded times for, say, 250x250 arrays and a 1 MB cache on your
> machine, and on the dual Xeon machine?
>    Curiously, on a 3 GHz Pentium 4 with 1 MB cache and compiling
> with -O2 -gnatp using the venerable gnat 3.15p, I get
> Time:           8.511855190       800 1 thread
> Time:           5.682686332       800 2 threads
> for a speedup of 33%
> Time:           0.092724434       200 1 thread
> Time:           0.059209520       200 2 threads
> for a speedup of 36%

On my 3.2 GHz P4 HT (best of 3 runs), I get

Time:           0.170563050       250 1 thread
Time:           0.134637140       250 2 threads

for a speedup of 24%.

Time:           0.113739732       200 1 thread
Time:           0.070599932       200 2 threads

for a speedup of 36%.

I'm still only getting about 10% speedup for 800. Is it possible my 
cache is smaller?

-- 
Jeff Carter
"I unclog my nose towards you."
Monty Python & the Holy Grail
11



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22  3:03             ` Jeffrey Creem
  2006-10-22  7:39               ` Jeffrey R. Carter
@ 2006-10-22 13:50               ` Alinabi
  2006-10-22 15:41                 ` Jeffrey Creem
  2006-10-22 15:57               ` Jeffrey Creem
  2 siblings, 1 reply; 68+ messages in thread
From: Alinabi @ 2006-10-22 13:50 UTC (permalink / raw)


I ran your test programs compiled with gcc 4.0.3 and the following
optimizations:
COMMON_FLAGS=-g -march=opteron -mtune=opteron -mfpmath=sse
-fomit-frame-pointer -O2 -fdump-tree-optimized
and I cannot reproduce the large differences in performance everyone
else talks about. Here are the times I get:

N        Ada            Fortran
====================
64      0.002029   0.000000
128    0.016321   0.016000
256    0.214143   0.204013
512    3.125888   3.124195
800    6.374982   5.864366
1024  34.10479   35.22620
2048  277.3071   283.2417

Jeffrey Creem wrote:
> Jeffrey Creem wrote:
> > Pascal Obry wrote:
> >
> >> Jeffrey Creem a écrit :
> >>
> >>
> >>> I'd certainly be willing to run a few benchmarks but the important thing
> >>> here is that rather innocent looking code is running 2-4x slower than it
> >>> "should".
> >>
> >>
> >>
> >> The first thing is to be sure that we are running the same program.
> >>
> >> Running your program with the following changes (as done in Fortran):
> >>
> >> 1. Using Sum tmp var for the computation
> >>
> >>    for I in A'range (1) loop
> >>       for J in A'range (2) loop
> >>          Sum := 0.0;
> >>          for R in A'range (2) loop
> >>             Sum := Sum + A (I, R) * B (R, J);
> >>          end loop;
> >>          C (I, J) := Sum;
> >>       end loop;
> >>    end loop;
> >>
> >> 2. Using Long_Float instead of Float (I think Fortran float is a
> >> Long_Float, to be checked).
> >>
> >> I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2).
> >>
> >> Pascal.
> >>
> >
> > Actually, the original poster's Ada program had the temp var and all of
> > my comparisons of programs that I have asserted were "the same" used the
> > temporary.
> >
> > As for Long_Float v.s. Short_Float, gfortran is using 32 bit floats (as
> > verified by dumping the tree representation and assembly language).
> >
> > Since we are all getting a bit confused by specific versions and numbers
> > I thought I'd post a summary and create a way of tracking more global
> > performance issues at the gnuada.sf.net wiki.
> >
> > The direct link to these test results are here:
> >
> > http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison
> >
> >
> >
>
>
>
> Actually, as a result of this, I submitted a bug report to the GCC
> bugzilla list. You can follow progress on it here:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29543
>
> Interesting initial feedback is that
> 1) Not an Ada bug.
> 2) Is a FORTRAN bug
> 3) Is a backend limitation of the optimizer.
>
> Of course, the FORTRAN one still runs correctly so I don't think most
> users will care that it is because of a bug :)




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 13:50               ` Alinabi
@ 2006-10-22 15:41                 ` Jeffrey Creem
  2006-10-23  0:02                   ` Alinabi
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-22 15:41 UTC (permalink / raw)

Alinabi wrote:
> I ran your test programs compiled with gcc 4.0.3 and the following
> optimizations:
> COMMON_FLAGS=-g -march=opteron -mtune=opteron -mfpmath=sse
> -fomit-frame-pointer -O2 -fdump-tree-optimized
> and I cannot reproduce the large differences in performance everyone
> else talks about. Here are the times I get:
> 
> N        Ada            Fortran
> ====================
> 64      0.002029   0.000000
> 128    0.016321   0.016000
> 256    0.214143   0.204013
> 512    3.125888   3.124195
> 800    6.374982   5.864366
> 1024  34.10479   35.22620
> 2048  277.3071   283.2417
> 

That is interesting. The question then becomes has FORTRAN improved on 
the way to 4.2.0 or has Ada regressed.

Try doing a make dis_all which should produce annotated assembly output. 
The Ada version can be a little daunting in the way we have setup the 
files since the generic instantiations at the top full the .S files 
(woops, looked like I named them .dis) with the generic instatance.

Even if you are not an assembly guru, if you start from the bottom of 
the files you can usually pretty quickly find that inner loop and 
compare the number of statements.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 15:41                 ` Jeffrey Creem
@ 2006-10-23  0:02                   ` Alinabi
  2006-10-23  5:28                     ` Gautier
  0 siblings, 1 reply; 68+ messages in thread
From: Alinabi @ 2006-10-23  0:02 UTC (permalink / raw)


Ok, it appears that the score is Fortran: 13 instructions -- Ada: 17
instructions in the inner loop.
Aggain, this is compiled with gcc 4.0.3 with the following switches:
-g -march=opteron -mtune=opteron -mfpmath=sse -fomit-frame-pointer -O2
-fdump-tree-optimized


Fortran inner loop (13 instructions):
=============
 3d4:   48 63 c1                movslq %ecx,%rax
 3d7:   ff c1                   inc    %ecx
 3d9:   48 89 c2                mov    %rax,%rdx
 3dc:   49 0f af c2             imul   %r10,%rax
 3e0:   48 0f af d3             imul   %rbx,%rdx
 3e4:   4c 01 c8                add    %r9,%rax
 3e7:   48 01 f0                add    %rsi,%rax
 3ea:   48 8d 14 17             lea    (%rdi,%rdx,1),%rdx
 3ee:   44 39 c1                cmp    %r8d,%ecx
 3f1:   f3 41 0f 10 04 83       movss  (%r11,%rax,4),%xmm0
 3f7:   f3 41 0f 59 04 94       mulss  (%r12,%rdx,4),%xmm0
 3fd:   f3 0f 58 c8             addss  %xmm0,%xmm1
 401:   75 d1                   jne    3d4 <MAIN__+0x3d4>

Ada inner loop (17 instructions):
==============
 a90:   ff c6                   inc    %esi
 a92:   48 63 d6                movslq %esi,%rdx
 a95:   48 89 d0                mov    %rdx,%rax
 a98:   48 29 e8                sub    %rbp,%rax
 a9b:   4c 01 d8                add    %r11,%rax
 a9e:   f3 41 0f 10 0c 84       movss  (%r12,%rax,4),%xmm1
 aa4:   4c 89 c0                mov    %r8,%rax
 aa7:   4c 29 d2                sub    %r10,%rdx
 aaa:   45 31 c9                xor    %r9d,%r9d
 aad:   48 83 c0 04             add    $0x4,%rax
 ab1:   49 0f 48 c1             cmovs  %r9,%rax
 ab5:   48 0f af d0             imul   %rax,%rdx
 ab9:   f3 0f 10 04 17          movss  (%rdi,%rdx,1),%xmm0
 abe:   f3 0f 59 c1             mulss  %xmm1,%xmm0
 ac2:   f3 0f 58 d0             addss  %xmm0,%xmm2
 ac6:   39 de                   cmp    %ebx,%esi
 ac8:   75 c6                   jne    a90 <_ada_tst_array+0x3a0>

Now, since Jeffrey was right about me not being an assembly guru, here
is the assembly code for all three nested loops, so that you can
doublecheck yourself.

Fortran:
========
   do I = 1,N
 2f2:   8b 84 24 ac 01 00 00    mov    0x1ac(%rsp),%eax
 2f9:   f3 0f 11 44 24 28       movss  %xmm0,0x28(%rsp)
 2ff:   85 c0                   test   %eax,%eax
 301:   0f 8e 28 01 00 00       jle    42f <MAIN__+0x42f>
      do J = 1,N
         sum = 0.0
         do R = 1,N
            sum = sum + A(I,R)*B(R,J)
 307:   48 8b 8c 24 28 01 00    mov    0x128(%rsp),%rcx
 30e:   00
 30f:   48 8b 94 24 18 01 00    mov    0x118(%rsp),%rdx
 316:   00
 317:   44 8d 40 01             lea    0x1(%rax),%r8d
 31b:   4c 8b a4 24 10 01 00    mov    0x110(%rsp),%r12
 322:   00
 323:   48 8b 9c 24 40 01 00    mov    0x140(%rsp),%rbx
 32a:   00
 32b:   4c 8b 9c 24 60 01 00    mov    0x160(%rsp),%r11
 332:   00
 333:   4c 8b 94 24 78 01 00    mov    0x178(%rsp),%r10
 33a:   00
 33b:   48 89 4c 24 10          mov    %rcx,0x10(%rsp)
 340:   48 89 54 24 18          mov    %rdx,0x18(%rsp)
 345:   48 8b 8c 24 90 01 00    mov    0x190(%rsp),%rcx
 34c:   00
         end do
         C(I,J) = sum
 34d:   48 8b 94 24 c0 00 00    mov    0xc0(%rsp),%rdx
 354:   00
 355:   4c 8b 8c 24 68 01 00    mov    0x168(%rsp),%r9
 35c:   00
 35d:   4c 8b bc 24 f0 00 00    mov    0xf0(%rsp),%r15
 364:   00
 365:   0f 57 d2                xorps  %xmm2,%xmm2
 368:   c7 44 24 30 01 00 00    movl   $0x1,0x30(%rsp)
 36f:   00
 370:   48 89 4c 24 20          mov    %rcx,0x20(%rsp)
 375:   48 89 54 24 48          mov    %rdx,0x48(%rsp)
 37a:   48 8b 8c 24 d8 00 00    mov    0xd8(%rsp),%rcx
 381:   00
 382:   48 8b 94 24 c8 00 00    mov    0xc8(%rsp),%rdx
 389:   00
 38a:   48 89 4c 24 40          mov    %rcx,0x40(%rsp)
 38f:   48 89 54 24 38          mov    %rdx,0x38(%rsp)
 394:   48 63 44 24 30          movslq 0x30(%rsp),%rax
 399:   48 8b 54 24 10          mov    0x10(%rsp),%rdx
 39e:   41 bd 01 00 00 00       mov    $0x1,%r13d
 3a4:   48 8b 4c 24 18          mov    0x18(%rsp),%rcx
 3a9:   48 0f af d0             imul   %rax,%rdx
 3ad:   48 0f af 44 24 40       imul   0x40(%rsp),%rax
 3b3:   48 8d 3c 0a             lea    (%rdx,%rcx,1),%rdi
 3b7:   48 8b 54 24 38          mov    0x38(%rsp),%rdx
 3bc:   4c 8d 34 10             lea    (%rax,%rdx,1),%r14
 3c0:   48 8b 74 24 20          mov    0x20(%rsp),%rsi
 3c5:   49 63 ed                movslq %r13d,%rbp
 3c8:   b9 01 00 00 00          mov    $0x1,%ecx
 3cd:   0f 28 ca                movaps %xmm2,%xmm1
 3d0:   48 0f af f5             imul   %rbp,%rsi
 3d4:   48 63 c1                movslq %ecx,%rax
 3d7:   ff c1                   inc    %ecx
 3d9:   48 89 c2                mov    %rax,%rdx
 3dc:   49 0f af c2             imul   %r10,%rax
 3e0:   48 0f af d3             imul   %rbx,%rdx
 3e4:   4c 01 c8                add    %r9,%rax
 3e7:   48 01 f0                add    %rsi,%rax
 3ea:   48 8d 14 17             lea    (%rdi,%rdx,1),%rdx
 3ee:   44 39 c1                cmp    %r8d,%ecx
 3f1:   f3 41 0f 10 04 83       movss  (%r11,%rax,4),%xmm0
 3f7:   f3 41 0f 59 04 94       mulss  (%r12,%rdx,4),%xmm0
 3fd:   f3 0f 58 c8             addss  %xmm0,%xmm1
 401:   75 d1                   jne    3d4 <MAIN__+0x3d4>
 403:   4c 89 f8                mov    %r15,%rax
 406:   48 8b 54 24 48          mov    0x48(%rsp),%rdx
 40b:   41 ff c5                inc    %r13d
 40e:   48 0f af c5             imul   %rbp,%rax
 412:   41 39 cd                cmp    %ecx,%r13d
 415:   49 8d 04 06             lea    (%r14,%rax,1),%rax
 419:   f3 0f 11 0c 82          movss  %xmm1,(%rdx,%rax,4)
 41e:   75 a0                   jne    3c0 <MAIN__+0x3c0>
 420:   ff 44 24 30             incl   0x30(%rsp)
 424:   44 39 6c 24 30          cmp    %r13d,0x30(%rsp)
 429:   0f 85 65 ff ff ff       jne    394 <MAIN__+0x394>
      end do
   end do


Ada:
=============
    for I in A'range(1) loop
 9a5:   41 8b 45 00             mov    0x0(%r13),%eax
 9a9:   41 8b 55 04             mov    0x4(%r13),%edx
 9ad:   39 d0                   cmp    %edx,%eax
 9af:   0f 8f 8a 01 00 00       jg     b3f <_ada_tst_array+0x44f>
      for J in A'range(2) loop
         Sum := 0.0;
         for R in A'range(2) loop
            Sum := Sum + A(I,R)*B(R,J);
 9b5:   4c 8b bc 24 d0 00 00    mov    0xd0(%rsp),%r15
 9bc:   00
         end loop;
         C(I,J) := Sum;
 9bd:   48 8b 8c 24 c0 00 00    mov    0xc0(%rsp),%rcx
 9c4:   00
 9c5:   4c 8b a4 24 e0 00 00    mov    0xe0(%rsp),%r12
 9cc:   00
 9cd:   89 44 24 68             mov    %eax,0x68(%rsp)
 9d1:   4c 89 7c 24 50          mov    %r15,0x50(%rsp)
 9d6:   48 89 8c 24 88 00 00    mov    %rcx,0x88(%rsp)
 9dd:   00
 9de:   45 8b 75 08             mov    0x8(%r13),%r14d
 9e2:   45 8b 6d 0c             mov    0xc(%r13),%r13d
 9e6:   44 89 6c 24 6c          mov    %r13d,0x6c(%rsp)
 9eb:   4c 63 7c 24 6c          movslq 0x6c(%rsp),%r15
 9f0:   48 63 f8                movslq %eax,%rdi
 9f3:   0f 57 e4                xorps  %xmm4,%xmm4
 9f6:   48 89 7c 24 20          mov    %rdi,0x20(%rsp)
 9fb:   49 63 ee                movslq %r14d,%rbp
 9fe:   89 54 24 2c             mov    %edx,0x2c(%rsp)
 a02:   44 89 eb                mov    %r13d,%ebx
 a05:   4c 89 7c 24 30          mov    %r15,0x30(%rsp)
 a0a:   44 3b 74 24 6c          cmp    0x6c(%rsp),%r14d
 a0f:   0f 8f 17 01 00 00       jg     b2c <_ada_tst_array+0x43c>
 a15:   48 8b 44 24 30          mov    0x30(%rsp),%rax
 a1a:   ba 00 00 00 00          mov    $0x0,%edx
 a1f:   45 89 f5                mov    %r14d,%r13d
 a22:   0f 28 dc                movaps %xmm4,%xmm3
 a25:   48 8b 4c 24 40          mov    0x40(%rsp),%rcx
 a2a:   48 29 e8                sub    %rbp,%rax
 a2d:   48 8d 04 85 04 00 00    lea    0x4(,%rax,4),%rax
 a34:   00
 a35:   48 85 c0                test   %rax,%rax
 a38:   48 0f 48 c2             cmovs  %rdx,%rax
 a3c:   48 63 54 24 68          movslq 0x68(%rsp),%rdx
 a41:   48 c1 f8 02             sar    $0x2,%rax
 a45:   48 89 54 24 70          mov    %rdx,0x70(%rsp)
 a4a:   48 2b 54 24 20          sub    0x20(%rsp),%rdx
 a4f:   49 89 d3                mov    %rdx,%r11
 a52:   4c 0f af d8             imul   %rax,%r11
 a56:   8b 01                   mov    (%rcx),%eax
 a58:   4c 63 d0                movslq %eax,%r10
 a5b:   8b 41 0c                mov    0xc(%rcx),%eax
 a5e:   48 98                   cltq
 a60:   8b 51 08                mov    0x8(%rcx),%edx
 a63:   48 63 d2                movslq %edx,%rdx
 a66:   48 29 d0                sub    %rdx,%rax
 a69:   48 89 14 24             mov    %rdx,(%rsp)
 a6d:   4c 8d 04 85 00 00 00    lea    0x0(,%rax,4),%r8
 a74:   00
 a75:   49 63 cd                movslq %r13d,%rcx
 a78:   4c 8b 7c 24 50          mov    0x50(%rsp),%r15
 a7d:   44 89 f6                mov    %r14d,%esi
 a80:   48 89 c8                mov    %rcx,%rax
 a83:   48 2b 04 24             sub    (%rsp),%rax
 a87:   0f 28 d3                movaps %xmm3,%xmm2
 a8a:   49 8d 3c 87             lea    (%r15,%rax,4),%rdi
 a8e:   eb 02                   jmp    a92 <_ada_tst_array+0x3a2>
 a90:   ff c6                   inc    %esi
 a92:   48 63 d6                movslq %esi,%rdx
 a95:   48 89 d0                mov    %rdx,%rax
 a98:   48 29 e8                sub    %rbp,%rax
 a9b:   4c 01 d8                add    %r11,%rax
 a9e:   f3 41 0f 10 0c 84       movss  (%r12,%rax,4),%xmm1
 aa4:   4c 89 c0                mov    %r8,%rax
 aa7:   4c 29 d2                sub    %r10,%rdx
 aaa:   45 31 c9                xor    %r9d,%r9d
 aad:   48 83 c0 04             add    $0x4,%rax
 ab1:   49 0f 48 c1             cmovs  %r9,%rax
 ab5:   48 0f af d0             imul   %rax,%rdx
 ab9:   f3 0f 10 04 17          movss  (%rdi,%rdx,1),%xmm0
 abe:   f3 0f 59 c1             mulss  %xmm1,%xmm0
 ac2:   f3 0f 58 d0             addss  %xmm0,%xmm2
 ac6:   39 de                   cmp    %ebx,%esi
 ac8:   75 c6                   jne    a90 <_ada_tst_array+0x3a0>
 aca:   48 8b 54 24 48          mov    0x48(%rsp),%rdx
 acf:   8b 02                   mov    (%rdx),%eax
 ad1:   48 63 d0                movslq %eax,%rdx
 ad4:   48 8b 7c 24 48          mov    0x48(%rsp),%rdi
 ad9:   8b 47 0c                mov    0xc(%rdi),%eax
 adc:   48 63 f8                movslq %eax,%rdi
 adf:   4c 8b 7c 24 48          mov    0x48(%rsp),%r15
 ae4:   41 8b 47 08             mov    0x8(%r15),%eax
 ae8:   48 98                   cltq
 aea:   4c 8b 7c 24 70          mov    0x70(%rsp),%r15
 aef:   48 29 c7                sub    %rax,%rdi
 af2:   48 29 c1                sub    %rax,%rcx
 af5:   48 8d 04 bd 04 00 00    lea    0x4(,%rdi,4),%rax
 afc:   00
 afd:   49 29 d7                sub    %rdx,%r15
 b00:   48 85 c0                test   %rax,%rax
 b03:   4c 89 fa                mov    %r15,%rdx
 b06:   49 0f 48 c1             cmovs  %r9,%rax
 b0a:   48 0f af d0             imul   %rax,%rdx
 b0e:   48 8b 84 24 88 00 00    mov    0x88(%rsp),%rax
 b15:   00
 b16:   48 8d 0c 88             lea    (%rax,%rcx,4),%rcx
 b1a:   f3 0f 11 14 11          movss  %xmm2,(%rcx,%rdx,1)
 b1f:   41 39 f5                cmp    %esi,%r13d
 b22:   74 08                   je     b2c <_ada_tst_array+0x43c>
 b24:   41 ff c5                inc    %r13d
 b27:   e9 49 ff ff ff          jmpq   a75 <_ada_tst_array+0x385>
 b2c:   8b 54 24 2c             mov    0x2c(%rsp),%edx
 b30:   39 54 24 68             cmp    %edx,0x68(%rsp)
 b34:   74 09                   je     b3f <_ada_tst_array+0x44f>
 b36:   ff 44 24 68             incl   0x68(%rsp)
 b3a:   e9 cb fe ff ff          jmpq   a0a <_ada_tst_array+0x31a>
      end loop;
   end loop;



Jeffrey Creem wrote:
> Alinabi wrote:
> > I ran your test programs compiled with gcc 4.0.3 and the following
> > optimizations:
> > COMMON_FLAGS=-g -march=opteron -mtune=opteron -mfpmath=sse
> > -fomit-frame-pointer -O2 -fdump-tree-optimized
> > and I cannot reproduce the large differences in performance everyone
> > else talks about. Here are the times I get:
> >
> > N        Ada            Fortran
> > ====================
> > 64      0.002029   0.000000
> > 128    0.016321   0.016000
> > 256    0.214143   0.204013
> > 512    3.125888   3.124195
> > 800    6.374982   5.864366
> > 1024  34.10479   35.22620
> > 2048  277.3071   283.2417
> >
>
> That is interesting. The question then becomes has FORTRAN improved on
> the way to 4.2.0 or has Ada regressed.
>
> Try doing a make dis_all which should produce annotated assembly output.
> The Ada version can be a little daunting in the way we have setup the
> files since the generic instantiations at the top full the .S files
> (woops, looked like I named them .dis) with the generic instatance.
>
> Even if you are not an assembly guru, if you start from the bottom of
> the files you can usually pretty quickly find that inner loop and
> compare the number of statements.




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23  0:02                   ` Alinabi
@ 2006-10-23  5:28                     ` Gautier
  2006-10-23 16:32                       ` Alinabi
  0 siblings, 1 reply; 68+ messages in thread
From: Gautier @ 2006-10-23  5:28 UTC (permalink / raw)


Alinabi:

> -g -march=opteron -mtune=opteron -mfpmath=sse -fomit-frame-pointer -O2
> -fdump-tree-optimized

Did you give -funroll-loops a try ?

G.




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23  5:28                     ` Gautier
@ 2006-10-23 16:32                       ` Alinabi
  0 siblings, 0 replies; 68+ messages in thread
From: Alinabi @ 2006-10-23 16:32 UTC (permalink / raw)


I did, and it made no difference. I think no loop unrolling is done in
the part of the code which is being timed, as the number of iterations
is not known at compile time.

Gautier wrote:
> Alinabi:
>
> > -g -march=opteron -mtune=opteron -mfpmath=sse -fomit-frame-pointer -O2
> > -fdump-tree-optimized
> 
> Did you give -funroll-loops a try ?
> 
> G.




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22  3:03             ` Jeffrey Creem
  2006-10-22  7:39               ` Jeffrey R. Carter
  2006-10-22 13:50               ` Alinabi
@ 2006-10-22 15:57               ` Jeffrey Creem
  2006-10-22 19:32                 ` Damien Carbonne
  2 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-22 15:57 UTC (permalink / raw)


Followup on the bug report.

One of the comments asserted that the two programs were not equivilent 
though I am not yet 100% convinced that I believe it yet.

His recommendation was to remove a level of indirection by changing the 
way the array is declared.

    N : Positive := Positive'Value (Argument (1));
    G : Ada.Numerics.Float_Random.Generator;

    type Real_Matrix is array (1..N, 1..N) of Float;
    type Matrix_Access is access Real_Matrix;

    A,B,C : Matrix_Access;
    Start, Finish : Ada.Calendar.Time;
    Sum : Float := 0.0;
begin
    A := new Real_Matrix;
    B := new Real_Matrix;
    C := new Real_Matrix;

This does indeed substantially improve the performance (still not quite 
to FORTRAN levels). The reason I question the equilivence is that in the 
original version, the array could have been passed to a procedure that 
took in an unconstrained array (which is pretty much what I think the 
FORTRAN version would allow) while this new version would not do that.


A quick check shows the FORTRAN is now only 1.2 times faster (did not do 
the multiple run thing yet). Perhaps tonight.

I'll also attempt to take one of the threaded ones and include that as well.


Can someone that understands FORTRAN better make an argument about the 
"closeness" of this approach v.s. the other?



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 15:57               ` Jeffrey Creem
@ 2006-10-22 19:32                 ` Damien Carbonne
  2006-10-22 20:00                   ` Gautier
                                     ` (3 more replies)
  0 siblings, 4 replies; 68+ messages in thread
From: Damien Carbonne @ 2006-10-22 19:32 UTC (permalink / raw)


Jeffrey Creem a ï¿½crit :
> Followup on the bug report.
> 
> One of the comments asserted that the two programs were not equivilent 
> though I am not yet 100% convinced that I believe it yet.
> 
<snip>
> 
> 
> Can someone that understands FORTRAN better make an argument about the 
> "closeness" of this approach v.s. the other?

I've been following this thread for some times and I'm not a Fortran 
Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same 
memory layout (Something like row major order and column major order).
I've not compared programs of this thread with care, but I've the 
feeling that this could change things.
Some years ago, I had to mix Ada and Fortran, and I remember that we had 
to invert loops order to obtain the same result.
One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada 
side to obtain the same result, or to exchange loops.
My knowledge of Fortran was limited to Fortran 77, and I don't know if 
this is still valid with Fortran 95 or later.
But if is still true, I would not consider the 2 programs as equivalent.

Regards

Damien Carbonne




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 19:32                 ` Damien Carbonne
@ 2006-10-22 20:00                   ` Gautier
  2006-10-22 20:51                     ` Damien Carbonne
  2006-10-23  1:31                   ` Jeffrey Creem
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 68+ messages in thread
From: Gautier @ 2006-10-22 20:00 UTC (permalink / raw)


Damien Carbonne:

> I've been following this thread for some times and I'm not a Fortran 
> Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same 
> memory layout (Something like row major order and column major order).
> I've not compared programs of this thread with care, but I've the 
> feeling that this could change things.
> Some years ago, I had to mix Ada and Fortran, and I remember that we had 
> to invert loops order to obtain the same result.
> One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada 
> side to obtain the same result, or to exchange loops.
> My knowledge of Fortran was limited to Fortran 77, and I don't know if 
> this is still valid with Fortran 95 or later.
> But if is still true, I would not consider the 2 programs as equivalent.

The question of Fortran's array storage has been treated earlier in this 
discussion, you just happened to add more noise by not reading enough posts 
;-). Look at the post of Martin Krischik, 20.10.06 18:30 CET... it shows the 
most serious testing so far, with the exact sources used for testing.
Cheers, Gautier
______________________________________________________________
Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm

NB: For a direct answer, e-mail address on the Web site!



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 20:00                   ` Gautier
@ 2006-10-22 20:51                     ` Damien Carbonne
  2006-10-23  2:15                       ` Jeffrey Creem
  0 siblings, 1 reply; 68+ messages in thread
From: Damien Carbonne @ 2006-10-22 20:51 UTC (permalink / raw)


Gautier a ï¿½crit :
> Damien Carbonne:
<snip>
> 
> 
> The question of Fortran's array storage has been treated earlier in this 
> discussion, you just happened to add more noise by not reading enough 
> posts ;-). Look at the post of Martin Krischik, 20.10.06 18:30 CET... it 
> shows the most serious testing so far, with the exact sources used for 
> testing.
Sorry for this. I'll go back to my silent lazy listening after this 
message :-) S. Tardieu had also suggested this layout issue in one 
message and I didn't follow the links given by M. Krischik. Anyway, 
independently of compiler bugs or limitations, differences in matrix 
layout leads to different (non strictly equivalent) programs! Which was 
precisely a question asked by J. Creem in his first sentence !

Regards
Damien



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 20:51                     ` Damien Carbonne
@ 2006-10-23  2:15                       ` Jeffrey Creem
  2006-10-23  2:29                         ` Jeffrey R. Carter
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-23  2:15 UTC (permalink / raw)

Damien Carbonne wrote:
> Gautier a ï¿½crit :
> 
>> Damien Carbonne:
> 
> <snip>
> 
>>
>>
>> The question of Fortran's array storage has been treated earlier in 
>> this discussion, you just happened to add more noise by not reading 
>> enough posts ;-). Look at the post of Martin Krischik, 20.10.06 18:30 
>> CET... it shows the most serious testing so far, with the exact 
>> sources used for testing.
> 
> Sorry for this. I'll go back to my silent lazy listening after this 
> message :-) S. Tardieu had also suggested this layout issue in one 
> message and I didn't follow the links given by M. Krischik. Anyway, 
> independently of compiler bugs or limitations, differences in matrix 
> layout leads to different (non strictly equivalent) programs! Which was 
> precisely a question asked by J. Creem in his first sentence !
> 
> Regards
> Damien

Actually, feel free to chime in even if you are not sure you are adding 
value. Heck, if we all waited until we were adding value, we'd never 
post anything :)

Part problem here was my poor wording of the question.

What I was asking is whether the two programs were sematically 
equivilent without dropping down to specific issues of storage 
representation.

Specifically, one of the biggest speedups posted (and suggested by one 
person on the bugzilla list) changes the Ada program so it is no longer 
a pointer to an unconstrained array but rather a pointer to a 
constrained array where the constrained array type definition is 
determined at run time based on the size of the command line parameter.

This may seem like the same thing but where I am asserting this falls 
apart is if someone wanted to write code that operated on variable sized 
arrays, this approach would probably not work (i.e. dynamic creation of 
a fixed length array).

It appears to me that the FORTRAN version is using as dynamic an array 
size as possible and further that it is possible to pass these 
dynamically sized arrays to other procedures (for example matmul). Now, 
matmul may be some sort of special case because it appears to be a 
compiler intrinsic on most compilers. I've poked around the web 
resources a bit for FORTRAN 90/95 tutorials but I have not found the 
specific examples that would convince me I know what I am talking about.

I guess what I am asking (or perhaps asserting) is that the original 
version of the Ada program is the closest to the FORTRAN version and not 
the one that creates the fixed length array since it appears to be that 
allocatable arrays of varying sizes can be passed to fortran 
subroutines. The original version of the Ada maintained that property. 
The modified version (and its derivatives that set the size in the type 
rather in the new) do not appear to me to be equivilent programs.

The reason this is important is that in general I don't think we are 
trying to solve the "write the fastest program that can calculate a 
matrix multiply" but rather investigate the way in which fairly simple 
programming idioms are translated into assembly language and understand 
their runtime implications (at least that is what my goal has been in this).

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23  2:15                       ` Jeffrey Creem
@ 2006-10-23  2:29                         ` Jeffrey R. Carter
  0 siblings, 0 replies; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-23  2:29 UTC (permalink / raw)


Jeffrey Creem wrote:
> 
> Specifically, one of the biggest speedups posted (and suggested by one 
> person on the bugzilla list) changes the Ada program so it is no longer 
> a pointer to an unconstrained array but rather a pointer to a 
> constrained array where the constrained array type definition is 
> determined at run time based on the size of the command line parameter.
> 
> This may seem like the same thing but where I am asserting this falls 
> apart is if someone wanted to write code that operated on variable sized 
> arrays, this approach would probably not work (i.e. dynamic creation of 
> a fixed length array).

You should be able to get the best of both worlds:

N : constant Positive := Integer'Value (Argument (1) );

type Base_Matrix is array (Positive range <>, Positive range <>)
of Float;

subtype Real_Matrix is Base_Matrix (1 .. N, 1 .. N);

-- 
Jeff Carter
"I unclog my nose towards you."
Monty Python & the Holy Grail
11



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 19:32                 ` Damien Carbonne
  2006-10-22 20:00                   ` Gautier
@ 2006-10-23  1:31                   ` Jeffrey Creem
  2006-10-23  3:10                     ` Jeffrey Creem
  2006-10-23 12:33                   ` Warner BRUNS
  2006-10-23 12:40                   ` Warner BRUNS
  3 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-23  1:31 UTC (permalink / raw)


Damien Carbonne wrote:
> Jeffrey Creem a ï¿½crit :
> 
>> Followup on the bug report.
>>
>> One of the comments asserted that the two programs were not equivilent 
>> though I am not yet 100% convinced that I believe it yet.
>>
> <snip>
> 
>>
>>
>> Can someone that understands FORTRAN better make an argument about the 
>> "closeness" of this approach v.s. the other?
> 
> 
> I've been following this thread for some times and I'm not a Fortran 
> Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same 
> memory layout (Something like row major order and column major order).
> I've not compared programs of this thread with care, but I've the 
> feeling that this could change things.
> Some years ago, I had to mix Ada and Fortran, and I remember that we had 
> to invert loops order to obtain the same result.
> One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada 
> side to obtain the same result, or to exchange loops.
> My knowledge of Fortran was limited to Fortran 77, and I don't know if 
> this is still valid with Fortran 95 or later.
> But if is still true, I would not consider the 2 programs as equivalent.
> 
> Regards
> 
> Damien Carbonne
> 

Yes, there is always the chance of row major/column major issues 
creeping in and causing dramatically different cache results however the 
original poster seemed to have tried the swap to no effect, and the code 
assembly code appears to show that this is/was not the problem.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23  1:31                   ` Jeffrey Creem
@ 2006-10-23  3:10                     ` Jeffrey Creem
  2006-10-23  7:31                       ` Jeffrey R. Carter
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-23  3:10 UTC (permalink / raw)


Ok, a lot of suggestions have been posted including a few that I have 
not addressed or even investigated (specifically, the post that talked 
about the assembly language output from gcc 4.0.3 which makes this 
appear (to me) to be a regression.

Having said that, I did update the wiki site with some timing numbers 
people asked for from a dual processor machine.

http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison

Scroll to the bottom and start reading at "Other Versions" if you are 
interested. This is getting too messy for plain text conversation.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23  3:10                     ` Jeffrey Creem
@ 2006-10-23  7:31                       ` Jeffrey R. Carter
  2006-10-23 11:55                         ` Jeffrey Creem
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-23  7:31 UTC (permalink / raw)

Jeffrey Creem wrote:
> 
> Having said that, I did update the wiki site with some timing numbers 
> people asked for from a dual processor machine.
> 
> http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison
> 
> Scroll to the bottom and start reading at "Other Versions" if you are 
> interested. This is getting too messy for plain text conversation.

Wow, a greater than 50% speedup from using 2 tasks instead of 1 (from 
ada/tst_array to ada/tst_array_task_2). Something is odd there.

This shows that, for those with dual-processor machines, easy-to-create 
Ada is faster than easy-to-create FORTRAN.

This doesn't help those with single processors, of course.

I doubt if anything will beat matmul (short of 640_000 processors). But 
Ada with explicit loops and FORTRAN with matmul are hardly equivalent. 
You need to convention-FORTRAN the Ada array type, import matmul, and 
call it to get a fair comparison. There shouldn't be much difference.

-- 
Jeff Carter
"I unclog my nose towards you."
Monty Python & the Holy Grail
11

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23  7:31                       ` Jeffrey R. Carter
@ 2006-10-23 11:55                         ` Jeffrey Creem
  2006-10-23 19:52                           ` Wiljan Derks
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-23 11:55 UTC (permalink / raw)

Jeffrey R. Carter wrote:
> Jeffrey Creem wrote:
> 
>>
>> Having said that, I did update the wiki site with some timing numbers 
>> people asked for from a dual processor machine.
>>
>> http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison 
>>
>>
>> Scroll to the bottom and start reading at "Other Versions" if you are 
>> interested. This is getting too messy for plain text conversation.
> 
> 
> Wow, a greater than 50% speedup from using 2 tasks instead of 1 (from 
> ada/tst_array to ada/tst_array_task_2). Something is odd there.
> 
> This shows that, for those with dual-processor machines, easy-to-create 
> Ada is faster than easy-to-create FORTRAN.
> 
> This doesn't help those with single processors, of course.
> 
> I doubt if anything will beat matmul (short of 640_000 processors). But 
> Ada with explicit loops and FORTRAN with matmul are hardly equivalent. 
> You need to convention-FORTRAN the Ada array type, import matmul, and 
> call it to get a fair comparison. There shouldn't be much difference.
> 

I looked at the gcc FORTRAN matmul. Unless there some additional 
trickery going on behind the scenes, it is not anything magical. It 
looks like a matmul implemented in C with manual array subscripting 
logic (i.e. uses a single dimensional array overlay)..

In any case, it is not so much matmul I am trying to make faster here 
but rather just the nature of 2d array traversals in native language 
structures.

I just included the FORTRAN matmul to be "more than fair" to FORTRAN as 
I am in no way trying to bash FORTRAN.

The speedup for the tasks is quite odd though. I'll need to disassemble 
it tonight.

I also just finished a 4.0.2 install last night so I'll get those 
numbers to see if all of this mess is simply a regression someplace in 
the compiler.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23 11:55                         ` Jeffrey Creem
@ 2006-10-23 19:52                           ` Wiljan Derks
  2006-10-23 20:25                             ` Jeffrey R. Carter
  2006-10-24  9:52                             ` Dr. Adrian Wrigley
  0 siblings, 2 replies; 68+ messages in thread
From: Wiljan Derks @ 2006-10-23 19:52 UTC (permalink / raw)


> Jeffrey R. Carter wrote:
> I looked at the gcc FORTRAN matmul. Unless there some additional trickery 
> going on behind the scenes, it is not anything magical. It looks like a 
> matmul implemented in C with manual array subscripting logic (i.e. uses a 
> single dimensional array overlay)..
When going for a faster algorithm this might be usefull information.
Notice that this matmul uses an O(n^3) algorithm and it is possible to do 
this faster.
I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster.
It is easy to see that from the loops that the basic multiplications are 
done n times for the same pairs of numbers.
In a similar way, the same additions are done multiple times.
That means that there is real potential to improve things.
For all small matrixes (let say 2 upto 4) one can write out all logic to 
multiply them and then optimize the code for those
specific multiplications. This will give us fast multiplications for small 
matrixes.
When having a larger matrix, is can be chopped into 4 smaller ones and these 
pieces can then be multiplied and added together.
In total this can lead to a faster algorithm, although much more complex 
then this multipliction.









^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23 19:52                           ` Wiljan Derks
@ 2006-10-23 20:25                             ` Jeffrey R. Carter
  2006-10-24  9:52                             ` Dr. Adrian Wrigley
  1 sibling, 0 replies; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-23 20:25 UTC (permalink / raw)


Wiljan Derks wrote:
>> Jeffrey R. Carter wrote:
>> I looked at the gcc FORTRAN matmul. Unless there some additional trickery 
>> going on behind the scenes, it is not anything magical. It looks like a 
>> matmul implemented in C with manual array subscripting logic (i.e. uses a 
>> single dimensional array overlay)..

For the record, this was Jeffrey Creem, not me.

-- 
Jeff Carter
"Blessed are they who convert their neighbors'
oxen, for they shall inhibit their girth."
Monty Python's Life of Brian
83



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23 19:52                           ` Wiljan Derks
  2006-10-23 20:25                             ` Jeffrey R. Carter
@ 2006-10-24  9:52                             ` Dr. Adrian Wrigley
  2006-10-24 11:50                               ` Jeffrey Creem
  2006-10-24 19:21                               ` Wiljan Derks
  1 sibling, 2 replies; 68+ messages in thread
From: Dr. Adrian Wrigley @ 2006-10-24  9:52 UTC (permalink / raw)


On Mon, 23 Oct 2006 21:52:59 +0200, Wiljan Derks wrote:

>> Jeffrey R. Carter wrote:
>> I looked at the gcc FORTRAN matmul. Unless there some additional trickery 
>> going on behind the scenes, it is not anything magical. It looks like a 
>> matmul implemented in C with manual array subscripting logic (i.e. uses a 
>> single dimensional array overlay)..
> When going for a faster algorithm this might be usefull information.
> Notice that this matmul uses an O(n^3) algorithm and it is possible to do 
> this faster.
> I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster.
> It is easy to see that from the loops that the basic multiplications are 
> done n times for the same pairs of numbers.

I know you're quite correct that there is a faster algorithm for
large n.  But looking at the loops, I don't see where the same
number pairs are multiplied at different points in the calculation.
Doesn't A(1,1)*B(1,1) only appear in C(1,1)?  Where else?

> In a similar way, the same additions are done multiple times.
> That means that there is real potential to improve things.

Aren't they only the same additions if the same products are summed?

I may be being particularly stupid today, but if the redundancy
is obvious from the loops, can you help me see where!  Thanks.
--
Adrian




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-24  9:52                             ` Dr. Adrian Wrigley
@ 2006-10-24 11:50                               ` Jeffrey Creem
  2006-10-24 16:24                                 ` Jeffrey R. Carter
  2006-10-24 19:21                               ` Wiljan Derks
  1 sibling, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-24 11:50 UTC (permalink / raw)

Dr. Adrian Wrigley wrote:
> On Mon, 23 Oct 2006 21:52:59 +0200, Wiljan Derks wrote:
> 
> 
>>>Jeffrey R. Carter wrote:
>>>I looked at the gcc FORTRAN matmul. Unless there some additional trickery 
>>>going on behind the scenes, it is not anything magical. It looks like a 
>>>matmul implemented in C with manual array subscripting logic (i.e. uses a 
>>>single dimensional array overlay)..
>>
>>When going for a faster algorithm this might be usefull information.
>>Notice that this matmul uses an O(n^3) algorithm and it is possible to do 
>>this faster.
>>I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster.
>>It is easy to see that from the loops that the basic multiplications are 
>>done n times for the same pairs of numbers.
> 
> 
> I know you're quite correct that there is a faster algorithm for
> large n.  But looking at the loops, I don't see where the same
> number pairs are multiplied at different points in the calculation.
> Doesn't A(1,1)*B(1,1) only appear in C(1,1)?  Where else?
> 
> 
>>In a similar way, the same additions are done multiple times.
>>That means that there is real potential to improve things.
> 
> 
> Aren't they only the same additions if the same products are summed?
> 
> I may be being particularly stupid today, but if the redundancy
> is obvious from the loops, can you help me see where!  Thanks.
> --
> Adrian
> 

I have not looked at algorithms in this area in a few years but I do 
recall that you need very very large N before any of them start being 
worthwhile.

There are some speedups possibly that stay N**3 but optimize fetches and 
stores for cache performance and do hand vectorization of portions of 
the inner loop that can achieve speedup.

I would guess that even the library level matmul (which still has the 
best overall time) could be made about 2x faster on a recent intel 
platform with those changes.

Back to the original discussion, there are quite a few interesting 
things going on.

1) I have confirmed what another poster wrote that 4.2.0 is 
substantially slower than 4.0.3 on the original Ada example (thus, this 
performance is a regression)
2) Small changes to the original code that still meet the original 
structure and intent of the original code can move the run time up and 
down by at least 50%
3) The two task based version of the code is running more than 2x faster 
than the single task version on a 2 processor machine. Some of this is 
from the two tasks but looking at the assembly, another portion of it is 
  related to #2 above in that the re-arrangement of the math allows the 
compiler to get less brain dead.

If I get a chance tonight, I'll post some update results and analysis of 
the generated code.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-24 11:50                               ` Jeffrey Creem
@ 2006-10-24 16:24                                 ` Jeffrey R. Carter
  2006-10-25  3:50                                   ` Jeffrey Creem
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-24 16:24 UTC (permalink / raw)


Jeffrey Creem wrote:
> 
> 2) Small changes to the original code that still meet the original 
> structure and intent of the original code can move the run time up and 
> down by at least 50%
> 3) The two task based version of the code is running more than 2x faster 
> than the single task version on a 2 processor machine. Some of this is 
> from the two tasks but looking at the assembly, another portion of it is 
>  related to #2 above in that the re-arrangement of the math allows the 
> compiler to get less brain dead.

These seem quite odd to me. Perhaps whatever is causing this is also the 
cause of the speedup I saw in the sequential case when the 
multiplication is put in a procedure.

-- 
Jeff Carter
"You've got the brain of a four-year-old boy,
and I bet he was glad to get rid of it."
Horse Feathers
47



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-24 16:24                                 ` Jeffrey R. Carter
@ 2006-10-25  3:50                                   ` Jeffrey Creem
  2006-10-25 15:32                                     ` claude.simon
  0 siblings, 1 reply; 68+ messages in thread
From: Jeffrey Creem @ 2006-10-25  3:50 UTC (permalink / raw)


Jeffrey R. Carter wrote:
> Jeffrey Creem wrote:
> 
>>
>> 2) Small changes to the original code that still meet the original 
>> structure and intent of the original code can move the run time up and 
>> down by at least 50%
>> 3) The two task based version of the code is running more than 2x 
>> faster than the single task version on a 2 processor machine. Some of 
>> this is from the two tasks but looking at the assembly, another 
>> portion of it is  related to #2 above in that the re-arrangement of 
>> the math allows the compiler to get less brain dead.
> 
> 
> These seem quite odd to me. Perhaps whatever is causing this is also the 
> cause of the speedup I saw in the sequential case when the 
> multiplication is put in a procedure.
> 

Ok, we are all probably tired of taking about this topic but I posted 
what I think will be the final write-up on this (until the bugzilla 
issue for it is resolved).

http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison

And I agree that the > 2x speedup for the two task version is not real. 
There are now other versions that are coded in a similar maner to the 2 
task version but which are single threaded. The 2 task version runs 
almost exactly 2x faster (sometimes slightly slower) on my dual 
processor machine.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-25  3:50                                   ` Jeffrey Creem
@ 2006-10-25 15:32                                     ` claude.simon
  0 siblings, 0 replies; 68+ messages in thread
From: claude.simon @ 2006-10-25 15:32 UTC (permalink / raw)


A way to have better performance is to optimize the use of the memory
cache.

An Ada writen matmul procedure to tend to the fortran intrinsic matmul
speed with -O2 and -gnatp


ada     version with array size of 600 : time = 0.6722
fortran version with array size of 600 : time = 0.6094


package Array_Product is

   type Real_Matrix is array(Integer range <>,Integer range <>) of
Float;
   type Matrix_Access is access Real_Matrix;

   procedure Matmul (A, B : in Real_Matrix; C : out Real_Matrix);

end Array_Product;


package body Array_Product is

   procedure Matmul (A, B : in Real_Matrix; C : out Real_Matrix)
   is
      BL : array (1 .. B'Length (1) * B'Length (2)) of Float;
      D  : Natural := 0;
      Sum : Float := 0.0;
   begin
      D := 0;
      -- transposition gave the best speedup cache + only one indice
     Transpose_B_in_BL :
      for J in B'Range (1) loop
         for R in B'Range (2) loop
            BL( D + R) := B(R,J);
         end loop;
         D := D + B'Length (2);
      end loop Transpose_B_in_BL;

      for I in A'Range (1) loop
         declare
            -- give the cache best chance with A second speedup
            Ai : array (A'range (2)) of Float;
         begin
            for R in Ai'Range loop
               Ai (R) := A (I, R);
            end loop;
            D := 0;
            for J in A'range(2) loop
               Sum := 0.0;
               for R in A'Range (2) loop
                  Sum := Sum + Ai(R)*BL(D + R);
               end loop;
               D := D + A'Length (2);
               C(I,J) := Sum;
            end loop;
         end;
      end loop;
   end Matmul;

end Array_Product;






Jeffrey Creem a écrit :

> Jeffrey R. Carter wrote:
> > Jeffrey Creem wrote:
> >
> >>
> >> 2) Small changes to the original code that still meet the original
> >> structure and intent of the original code can move the run time up and
> >> down by at least 50%
> >> 3) The two task based version of the code is running more than 2x
> >> faster than the single task version on a 2 processor machine. Some of
> >> this is from the two tasks but looking at the assembly, another
> >> portion of it is  related to #2 above in that the re-arrangement of
> >> the math allows the compiler to get less brain dead.
> >
> >
> > These seem quite odd to me. Perhaps whatever is causing this is also the
> > cause of the speedup I saw in the sequential case when the
> > multiplication is put in a procedure.
> >
>
> Ok, we are all probably tired of taking about this topic but I posted
> what I think will be the final write-up on this (until the bugzilla
> issue for it is resolved).
>
> http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison
>
> And I agree that the > 2x speedup for the two task version is not real.
> There are now other versions that are coded in a similar maner to the 2
> task version but which are single threaded. The 2 task version runs
> almost exactly 2x faster (sometimes slightly slower) on my dual
> processor machine.




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-24  9:52                             ` Dr. Adrian Wrigley
  2006-10-24 11:50                               ` Jeffrey Creem
@ 2006-10-24 19:21                               ` Wiljan Derks
  1 sibling, 0 replies; 68+ messages in thread
From: Wiljan Derks @ 2006-10-24 19:21 UTC (permalink / raw)



"Dr. Adrian Wrigley" <amtw@linuxchip.demon.co.uk.uk.uk> wrote in message 
news:pan.2006.10.24.09.54.07.340442@linuxchip.demon.co.uk.uk.uk...
> I know you're quite correct that there is a faster algorithm for
> large n.  But looking at the loops, I don't see where the same
> number pairs are multiplied at different points in the calculation.
> Doesn't A(1,1)*B(1,1) only appear in C(1,1)?  Where else?
>
> Aren't they only the same additions if the same products are summed?
>
Sorry, I was to fast writing this remark.
Since I was wrong, I checked my old school papers and found how to perfrom a 
faster matrix multiplication.
It is called the Strassen algorithm.
For more details see: http://en.wikipedia.org/wiki/Strassen_algorithm
It still looks to me, there may be considerable savings since this algortihm 
is O(n^2.8) instead of O(n^3).
Wikipedia also mention the drawback: numeric instability!
So the subtraction and addition tricks probably lead to less accurate 
results in certain cases.





^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 19:32                 ` Damien Carbonne
  2006-10-22 20:00                   ` Gautier
  2006-10-23  1:31                   ` Jeffrey Creem
@ 2006-10-23 12:33                   ` Warner BRUNS
  2006-10-23 12:40                   ` Warner BRUNS
  3 siblings, 0 replies; 68+ messages in thread
From: Warner BRUNS @ 2006-10-23 12:33 UTC (permalink / raw)


Damien Carbonne wrote:
> Jeffrey Creem a ï¿½crit :
>> Followup on the bug report.
>>
>> One of the comments asserted that the two programs were not equivilent 
>> though I am not yet 100% convinced that I believe it yet.
>>
> <snip>
>>
>>
>> Can someone that understands FORTRAN better make an argument about the 
>> "closeness" of this approach v.s. the other?
> 
> I've been following this thread for some times and I'm not a Fortran 
> Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same 
> memory layout (Something like row major order and column major order).
> I've not compared programs of this thread with care, but I've the 
> feeling that this could change things.
> Some years ago, I had to mix Ada and Fortran, and I remember that we had 
> to invert loops order to obtain the same result.
> One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada 
> side to obtain the same result, or to exchange loops.
> My knowledge of Fortran was limited to Fortran 77, and I don't know if 
> this is still valid with Fortran 95 or later.
> But if is still true, I would not consider the 2 programs as equivalent.
> 
> Regards
> 
> Damien Carbonne
> 

A problem with Ada is that the memory layout of a twodimensional
array is not specified in the Ada Standard.
If you want to have a uniform memory layout between different Ada compilers,
you have to specify that the memory layout should be like Fortran would do it.
Very odd.

    TYPE DefReal IS DIGITS 12;
    TYPE DefReal2D IS ARRAY(DefInt RANGE <>,
                            DefInt RANGE <>) OF DefReal;
    PRAGMA CONVENTION (Fortran, DefReal2D);

Warner



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-22 19:32                 ` Damien Carbonne
                                     ` (2 preceding siblings ...)
  2006-10-23 12:33                   ` Warner BRUNS
@ 2006-10-23 12:40                   ` Warner BRUNS
  2006-10-23 13:52                     ` Georg Bauhaus
  2006-10-23 15:02                     ` Robert A Duff
  3 siblings, 2 replies; 68+ messages in thread
From: Warner BRUNS @ 2006-10-23 12:40 UTC (permalink / raw)


Damien Carbonne wrote:
> Jeffrey Creem a ï¿½crit :
>> Followup on the bug report.
>>
>> One of the comments asserted that the two programs were not equivilent 
>> though I am not yet 100% convinced that I believe it yet.
>>
> <snip>
>>
>>
>> Can someone that understands FORTRAN better make an argument about the 
>> "closeness" of this approach v.s. the other?
> 
> I've been following this thread for some times and I'm not a Fortran 
> Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same 
> memory layout (Something like row major order and column major order).
> I've not compared programs of this thread with care, but I've the 
> feeling that this could change things.
> Some years ago, I had to mix Ada and Fortran, and I remember that we had 
> to invert loops order to obtain the same result.
> One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada 
> side to obtain the same result, or to exchange loops.
> My knowledge of Fortran was limited to Fortran 77, and I don't know if 
> this is still valid with Fortran 95 or later.
> But if is still true, I would not consider the 2 programs as equivalent.
> 
> Regards
> 
> Damien Carbonne
> 
A problem with Ada is that the memory layout of multidimensional arrays
is not specified in the standard. If you want to have the same memory
layout for all Ada compilers, you have to specify that a multimensional
array should follow the convention of a foreign language, eg. Fortran.
    TYPE DefReal IS DIGITS 12;
    TYPE DefReal2D IS ARRAY(Integer RANGE <>,
                            Integer RANGE <>) OF DefReal;
    PRAGMA CONVENTION (Fortran, DefReal2D);

There is something missing in the Ada standard.

Warner



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23 12:40                   ` Warner BRUNS
@ 2006-10-23 13:52                     ` Georg Bauhaus
  2006-10-23 17:11                       ` Warner BRUNS
  2006-10-23 15:02                     ` Robert A Duff
  1 sibling, 1 reply; 68+ messages in thread
From: Georg Bauhaus @ 2006-10-23 13:52 UTC (permalink / raw)


On Mon, 2006-10-23 at 14:40 +0200, Warner BRUNS wrote:


> A problem with Ada is that the memory layout of multidimensional arrays
> is not specified in the standard. If you want to have the same memory
> layout for all Ada compilers, you have to specify that a multimensional
> array should follow the convention of a foreign language, eg. Fortran.

Will a memory layout cast in source really help with e.g. all
Ada X Fortran combinations of compilers and libraries?
It might disallow "knowledgeable" compilers to choose a layout
that suits the current setup, I'd think, and require touching
the sources.


-- Georg 





^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23 13:52                     ` Georg Bauhaus
@ 2006-10-23 17:11                       ` Warner BRUNS
  2006-10-23 17:57                         ` Dr. Adrian Wrigley
  0 siblings, 1 reply; 68+ messages in thread
From: Warner BRUNS @ 2006-10-23 17:11 UTC (permalink / raw)


Georg Bauhaus wrote:
> On Mon, 2006-10-23 at 14:40 +0200, Warner BRUNS wrote:
> 
> 
>> A problem with Ada is that the memory layout of multidimensional arrays
>> is not specified in the standard. If you want to have the same memory
>> layout for all Ada compilers, you have to specify that a multimensional
>> array should follow the convention of a foreign language, eg. Fortran.
> 
> Will a memory layout cast in source really help with e.g. all
> Ada X Fortran combinations of compilers and libraries?
> It might disallow "knowledgeable" compilers to choose a layout
> that suits the current setup, I'd think, and require touching
> the sources.
> 
> 
> -- Georg 
> 
>
It will help with all Ada/ Fortran Compiler combinations,
but the really strange thing is that the memory layout is not
definitely known from the Ada source alone, if it is not specified
as being the layout as required by a different language, eg. Fortran.
    If one wants to code a cache aware algorithm for eg. multiplying
matrices, one cannot do this in Ada without enforcing the memory
layout of the matrices via PRAGMA CONVENTION.

Warner



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23 17:11                       ` Warner BRUNS
@ 2006-10-23 17:57                         ` Dr. Adrian Wrigley
  0 siblings, 0 replies; 68+ messages in thread
From: Dr. Adrian Wrigley @ 2006-10-23 17:57 UTC (permalink / raw)


On Mon, 23 Oct 2006 19:11:51 +0200, Warner BRUNS wrote:

> Georg Bauhaus wrote:
>> On Mon, 2006-10-23 at 14:40 +0200, Warner BRUNS wrote:
>> 
>>> A problem with Ada is that the memory layout of multidimensional arrays
>>> is not specified in the standard. If you want to have the same memory
>>> layout for all Ada compilers, you have to specify that a multimensional
>>> array should follow the convention of a foreign language, eg. Fortran.
>> 
>> Will a memory layout cast in source really help with e.g. all
>> Ada X Fortran combinations of compilers and libraries?
>> It might disallow "knowledgeable" compilers to choose a layout
>> that suits the current setup, I'd think, and require touching
>> the sources.
>> 
>> -- Georg 
>> 
> It will help with all Ada/ Fortran Compiler combinations,
> but the really strange thing is that the memory layout is not
> definitely known from the Ada source alone, if it is not specified
> as being the layout as required by a different language, eg. Fortran.
>     If one wants to code a cache aware algorithm for eg. multiplying
> matrices, one cannot do this in Ada without enforcing the memory
> layout of the matrices via PRAGMA CONVENTION.

A agree that it seems like an omission from the Ada standard.
This specification should be available using a representation
clause.  Telling the compiler to allocate it like Fortran
would is the wrong way, logically.

Of course, if it really mattered, you could use an array of arrays
(like in C), or an array of array accesses.  With appropriate
use of pragma representation you have a very high degree of flexibility.

Compare this to almost any other language, and you find Ada leads
the field by a mile.  What if you wanted an alternative layout
in Fortran? Or C++?

Sometimes I'd like to see some intermingling of row/column ordering,
so a page in memory holds a square sub-array of a 2-D array.
The approach I'll take (if I get around to it!) is to use a
2-D array of acceses, allocated (by a pool allocator) with the
desired pattern.  Moral of the story: One Size does not Fit All...
--
Adrian




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23 12:40                   ` Warner BRUNS
  2006-10-23 13:52                     ` Georg Bauhaus
@ 2006-10-23 15:02                     ` Robert A Duff
  2006-10-23 20:22                       ` Jeffrey R. Carter
  1 sibling, 1 reply; 68+ messages in thread
From: Robert A Duff @ 2006-10-23 15:02 UTC (permalink / raw)


Warner BRUNS <Warner.Bruns@cern.ch> writes:

> A problem with Ada is that the memory layout of multidimensional arrays
> is not specified in the standard. If you want to have the same memory
> layout for all Ada compilers, you have to specify that a multimensional
> array should follow the convention of a foreign language, eg. Fortran.
>    TYPE DefReal IS DIGITS 12;
>    TYPE DefReal2D IS ARRAY(Integer RANGE <>,
>                            Integer RANGE <>) OF DefReal;
>    PRAGMA CONVENTION (Fortran, DefReal2D);

In practise, you can count on the fact that all Ada compilers will
choose row-major order, unless you ask for column-major via the above
pragma, even though the RM does not say so.

> There is something missing in the Ada standard.

Perhaps.

- Bob



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-23 15:02                     ` Robert A Duff
@ 2006-10-23 20:22                       ` Jeffrey R. Carter
  0 siblings, 0 replies; 68+ messages in thread
From: Jeffrey R. Carter @ 2006-10-23 20:22 UTC (permalink / raw)

Robert A Duff wrote:
> 
> In practise, you can count on the fact that all Ada compilers will
> choose row-major order, unless you ask for column-major via the above
> pragma, even though the RM does not say so.

Because of the implementation advice in ARM 3.6.2.: "An implementation 
should normally represent multidimensional arrays in row-major order, 
consistent with the notation used for multidimensional array aggregates 
(see 4.3.3). However, if a pragma Convention(Fortran, ...) applies to a 
multidimensional array type, then column-major order should be used 
instead (see B.5, ``Interfacing with Fortran'')."

-- 
Jeff Carter
"Blessed are they who convert their neighbors'
oxen, for they shall inhibit their girth."
Monty Python's Life of Brian
83

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-21 16:35       ` Jeffrey Creem
  2006-10-21 17:04         ` Pascal Obry
@ 2006-10-21 18:28         ` tmoran
  1 sibling, 0 replies; 68+ messages in thread
From: tmoran @ 2006-10-21 18:28 UTC (permalink / raw)


> I'd certainly be willing to run a few benchmarks but the important thing
> here is that rather innocent looking code is running 2-4x slower than it
>   "should".
But:
> > for Ada: "tst_array 800" runs in 18 seconds
> > for Fortran "tst_array 800" runs in 6 seconds
> >
> > (if I use the fortran "matmul" intrinsic the fortran time drops to
> > 2.5 seconds)
  Clearly the right way to do large matrix multiplies is to call matmul.
I'd be surprised if it made any significant difference whether it was
called from Ada or from Fortran.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-21  7:37     ` Gautier
  2006-10-21 16:35       ` Jeffrey Creem
@ 2006-10-23  6:28       ` Martin Krischik
  1 sibling, 0 replies; 68+ messages in thread
From: Martin Krischik @ 2006-10-23  6:28 UTC (permalink / raw)


Gautier schrieb:
> Jeffrey Creem:
> 
>> Note, I am the first one to jump to the defense of "Ada" in general 
>> but in this case, GNAT just plain sucks compared to GNU FORTRAN as it 
>> does a poor job on (at least) the inner loop (verified by looking at 
>> the output assembly)
> 
> There is something strange... Martin Krischik was able to trim the 
> overall time for the Ada code down to 24% of the first version (GNAT/GCC 
> 4.1.1).
> This should make the Ada program as fast as the FORTRAN one, shouldn't it ?
> Maybe it's because the test is done on a 64 bit machine ?

A 64 bit system doesn't need -march -mcpu and -msse2 as desperately. Try

 >gcc -dumpmachine
i686-pc-cygwin

For this system i686 is default.

Martin



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: GNAT compiler switches and optimization
  2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
                   ` (11 preceding siblings ...)
  2006-10-20 22:11 ` Jeffrey R. Carter
@ 2006-10-21 12:39 ` Dr. Adrian Wrigley
  12 siblings, 0 replies; 68+ messages in thread
From: Dr. Adrian Wrigley @ 2006-10-21 12:39 UTC (permalink / raw)

On Fri, 20 Oct 2006 03:47:44 -0700, tkrauss wrote:

> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code.  I suspect there is something simple (like
> compiler switches) but I'm missing it.  As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran.  Unfortunately the Ada code takes 3-4 times as long.
> 
> I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following
> compile options:
> 
> gnat make -O3 -gnatp tst_array
> gfortran -O3 tst_array.f95
> 
> Running them on 800x800 matrices (on my 2GHz laptop)
> 
> for Ada: "tst_array 800" runs in 18 seconds
> for Fortran "tst_array 800" runs in 6 seconds
> 
> (if I use the fortran "matmul" intrinsic the fortran time drops to
> 2.5 seconds)
> 
> Note, I tried reordering the loops, removing the random calls, etc.
> none of which made huge changes.  There is something killing
> performance
> and/or a switch or two that I'm missing, but I can't seem to find it.
> Any
> thoughts?

When I started using Ada, I found exactly the same thing.  Programs
ran slower.  Sometimes much slower.  Perhaps this is the biggest
single disadvantage of Ada (GNAT) in practice, particularly for
heavy numerical codes (compared to Fortran or C).

Looking closer, I found the assembly output was sometimes
littered with apparently redundant or inintended function
calls, extra checks and other baggage.

The prevailing claims at the time were that Ada was roughly
as fast as C, sometimes faster (because of deeper semantics).
Whenever logically identical code was compared, however,
the output assembly code was often identical, giving the
same performance.

In real code, I found big differences when using enumerations
instead of integers, multi-dimansional arrays etc.  And
language-defined maths libraries, random number generators,
I/O etc. all risked major slow-downs.

The only solution I found was to examine the output code,
and modify the source until I got code that was acceptable.
Sometimes this meant dropping useful language features, or
using less clear constructs.

This is a real problem with the language (Ada using GNAT).
For a lot of applications, the optimisation effort isn't
worth it.  And even for performance critical applications,
most code is outside of any hot-spots.

I get the impression that Fortran is an excellent language
for getting efficient code easily.  C is also quite good.
But C++ and Ada both seem to "hand-holding" to keep them
efficient.  Perl is the worst I know(!)

If you really care about performance, you'll check the
assembly code or compare against expectations.  As you
fix unexpected bottlenecks, you'll find out what type of
code compiles well with your compiler, and write future
code avoiding the problem areas.  Of course, when the
compiler changes, you may find the rules change and your
old code appears quaint or idiosyncratic.

The other posters to this thread have given some useful
optimisations to your original code.  Let us know
whether this bridges the performance gap for you!
--
Adrian Wrigley, Cambridge, UK.

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2006-10-25 15:32 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-10-20 10:47 GNAT compiler switches and optimization tkrauss
2006-10-20 11:04 ` Duncan Sands
2006-10-21 10:45   ` Stephen Leake
2006-10-20 11:42 ` Duncan Sands
2006-10-20 15:41   ` Martin Krischik
2006-10-20 12:09 ` Samuel Tardieu
2006-10-20 12:18   ` Samuel Tardieu
2006-10-20 12:12 ` Gautier
2006-10-20 12:35 ` Dmitry A. Kazakov
2006-10-20 15:53   ` Martin Krischik
2006-10-20 12:52 ` Gautier
2006-10-20 13:27 ` claude.simon
2006-10-20 15:38 ` Robert A Duff
2006-10-20 19:32   ` Gautier
2006-10-20 15:56 ` Jeffrey Creem
2006-10-20 16:30 ` Martin Krischik
2006-10-20 19:51 ` Gautier
2006-10-20 22:11 ` Jeffrey R. Carter
2006-10-20 23:52   ` Jeffrey Creem
2006-10-21  7:37     ` Gautier
2006-10-21 16:35       ` Jeffrey Creem
2006-10-21 17:04         ` Pascal Obry
2006-10-21 21:22           ` Jeffrey Creem
2006-10-22  3:03             ` Jeffrey Creem
2006-10-22  7:39               ` Jeffrey R. Carter
2006-10-22 11:48                 ` tkrauss
2006-10-22 18:02                   ` Georg Bauhaus
2006-10-22 18:24                     ` Jeffrey Creem
2006-10-23  0:10                       ` Georg Bauhaus
2006-10-22 20:20                   ` Jeffrey R. Carter
2006-10-22 12:31                 ` Gautier
2006-10-22 20:26                   ` Jeffrey R. Carter
2006-10-22 21:22                     ` Gautier
2006-10-22 18:01                 ` tmoran
2006-10-22 20:54                   ` Jeffrey R. Carter
2006-10-22 13:50               ` Alinabi
2006-10-22 15:41                 ` Jeffrey Creem
2006-10-23  0:02                   ` Alinabi
2006-10-23  5:28                     ` Gautier
2006-10-23 16:32                       ` Alinabi
2006-10-22 15:57               ` Jeffrey Creem
2006-10-22 19:32                 ` Damien Carbonne
2006-10-22 20:00                   ` Gautier
2006-10-22 20:51                     ` Damien Carbonne
2006-10-23  2:15                       ` Jeffrey Creem
2006-10-23  2:29                         ` Jeffrey R. Carter
2006-10-23  1:31                   ` Jeffrey Creem
2006-10-23  3:10                     ` Jeffrey Creem
2006-10-23  7:31                       ` Jeffrey R. Carter
2006-10-23 11:55                         ` Jeffrey Creem
2006-10-23 19:52                           ` Wiljan Derks
2006-10-23 20:25                             ` Jeffrey R. Carter
2006-10-24  9:52                             ` Dr. Adrian Wrigley
2006-10-24 11:50                               ` Jeffrey Creem
2006-10-24 16:24                                 ` Jeffrey R. Carter
2006-10-25  3:50                                   ` Jeffrey Creem
2006-10-25 15:32                                     ` claude.simon
2006-10-24 19:21                               ` Wiljan Derks
2006-10-23 12:33                   ` Warner BRUNS
2006-10-23 12:40                   ` Warner BRUNS
2006-10-23 13:52                     ` Georg Bauhaus
2006-10-23 17:11                       ` Warner BRUNS
2006-10-23 17:57                         ` Dr. Adrian Wrigley
2006-10-23 15:02                     ` Robert A Duff
2006-10-23 20:22                       ` Jeffrey R. Carter
2006-10-21 18:28         ` tmoran
2006-10-23  6:28       ` Martin Krischik
2006-10-21 12:39 ` Dr. Adrian Wrigley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox